Marking essays and poisoning dogs

This psychological experiment asked participants to judge the following actions.

(1) Stealing a towel from a hotel
(2) Keeping a dime you find on the ground
(3) Poisoning a barking dog

They had to give each action a mark out of 10 depending on how immoral the action was, on a scale where 1 is not particularly bad or wrong and 10 is extremely evil.

A second group were asked to do the same, but they were given the following three actions to judge.

(1”) Testifying falsely for pay
(2”) Using guns on striking workers
(3”) Poisoning a barking dog

I am sure you can guess the point of this. Item (3) and item (3”) are identical, and yet the two groups consistently differ on their ratings of these items. The latter group judge the action to be less evil than the former group. The reason is not hard to see: when you are thinking in terms of stealing towels and dimes, poisoning a barking dog seems heinous, but in the context of killing humans, it seems less so. The same principle has been observed in many other fields, and has led many psychologists to conclude that human judgment is essentially comparative, not absolute. There is a fantastic game on the No More Marking website which demonstrates exactly the same point. I recommend you click on the image below which links to the game, and play it right now, as it will illustrate this point better than any words can.

Screen Shot 2015-06-27 at 15.02.25

In brief, the game asks you to look at 8 different shades of blue individually and rank them from light to dark. It then asks you to make a series of comparisons between shades of blue. Everyone is much better at the latter than at the former. The No More Marking website also includes a bibliography about this issue.

Hopefully, you can also see the applications of this to assessment. This is one of the reasons why we need to define absolute criteria with comparative examples.

The interesting thing to note about the ‘evil’ and ‘blue’ examples is that the criteria are not that complex. One does not need special training or qualifications to be able to tell light blue from dark blue. The final judgment is one that everyone would agree with. Similarly, whilst judging evil is morally complex, it is not technically complex – everyone knows what it means. And yet, even in cases where the criteria are so clear, and so well understood, we still struggle. Imagine how much more we will struggle when the criteria are technically complex, as they are in exams.  If we aren’t very accurate when asked to judge evil or blue in absolute terms, what will we be like when asked to judge how well pupils have analysed a text? The other thing this suggests is that learning more about a particular topic, and learning more about how pupils respond to it, will not of itself make you a better marker. You could have great expertise and experience in teaching and reading essays on a certain topic, but if you continue to mark them in this absolute way, you will still struggle. Expertise in a topic, and experience in marking essays on that topic, are necessary but not sufficient conditions of being an accurate marker. However expert we are in a topic, we need comparative examples to guide us.

Unfortunately, over the last few years, the idea that we can judge work absolutely has become very popular. Pupils’ essays are ticked off against APP grids or mark schemes, and if they tick enough of the statements, then that means they have achieved a certain grade. But as we have seen, this approach is open to so much interpretation. Our brains are just not equipped to make judgments in this way. I also worry that such an approach has a negative impact on teaching, as teachers teach towards the mark scheme and pupils produce essays which start to sound like the mark scheme itself. Instead, what we need to do is to take a series of essays and compare them against each other. This is at the heart of the No More Marking approach, which also has the benefit of making marking less time-consuming.  If you aren’t ready to adopt No More Marking, you can still get some of the benefits of this approach by changing the way you mark and think about marking. Instead of judging essays against criteria, compare them to other essays.  Comparisons are at the heart of human judgment.

I am grateful to Chris Wheadon from No More Marking for talking me through how his approach works. No More Marking are running a trial with FFT and Education Datalab which will explore how their approach can be used to measure progress across secondary. See here for more detail.

As an interesting aside, one of the seminal works in the research on the limitations of human judgment is George Miller’s paper The Magical Number Seven. I knew of this paper in a different context: it is also a seminal work in the field of working memory, and the limitations of working memory. Miller also wrote the excellent, and very practical, article ‘How Children Learn Words‘, which shows how looking words up in dictionaries and other reference sources may not be the best strategy for learning new vocabulary. I’ve written a bit about this here.

My last few posts have all been about the flaws in using criteria, and alternatives to using them. In my last post, I mentioned that there were two pieces of theory which had influenced my thinking on this: research on human judgment, and on tacit knowledge. This blog post has looked at the research on human judgment, and in my next, I will consider how the idea of tacit knowledge also casts doubt on the use of criteria.

0 responses to “Marking essays and poisoning dogs”

  1. Dick Schutz says:

    This has a lot to say about human judgement in general and about scoring/grading essay exams/”constructed responses” in particular. But it has (almost) nothing to say about “criteria” in general or about communicating the results of achievement test scores in particular.

    Some schooling matters CAN be judged absolutely–matters of grammar, reading accuracy and speed, computational accuracy, scientific nomenclature, and so on. 2+2=4; “dog” is pronounced dog and not god, and so on. The confusion arises in the psychometry of Item Response Theory, which goes back 100 years to E. L. Thorndike’s Credo: “Whatever exists at all exists in some amount.” Item Response Theory subsequently evolved to produce measures of the “whatevers.” In Thorndike’s time the “whatevers” were viewed as “traits/faculty”; today they are commonly viewed as “abilities/skill,” but the referent is the same–“something” that is within the individual.

    With IRT and computers, “achievement tests” can be churned out routinely and scaled numerically. “No problems” up to this point, but now the problems begin.

    The problems are (at least) two-fold. One, the scaled numbers per se are meaningless. As Dylan William noted in commenting on an earlier blog in the series, the “whatever/ability” exists only in the heads of the community that agrees on it’s meaning/definition. That is, the “measurement” does not instantiate (bring into existence) the “whatever.” Not everything that is IRT-measured exists, and some things that aren’t IRT-measured are important. Whether or not these important matters can be instructed is an empirical matter.

    The second and more consequential problem is that the IRT-derived “measurement” provides no information about the instruction that was conducted–intended to reliably instantiate the “whatever.” Since the purpose of the testing is to shed light on and to improve instruction, this is a “big problem.” It gives rise to calls for “no marks, no tests.” But discarding measurement doesn’t resolve the “problem”–the black box of instruction remains a black box.

    Formerly, the UK DfE reported achievement tests in terms of “Levels.” That practice has been discarded (which in my view was a move in the right direction, but in any case, it’s a done deal). Beginning in 2016, results will be reported in terms of scaled scores, but the replacement for “Levels” as the means of interpreting the results hasn’t been disclosed. If the protocol in common use in the US is followed, all the shortcomings of “Levels” will still be there. I don’t know of an alternative protocol, but there may be one.

    There’s a measurement alternative to the prevailing psychometric view, but that’s a whole nother story.

  2. julietgreen says:

    I’m very glad to see the emergence of increasing challenge to the notion of teacher assessment being in any way ‘reliable’. I have fought for years in my own establishment to try to undermine people’s fast-held belief that it can be, and that moderation can make it ‘more so’. Moderation is just another word for ‘group effect’. At it’s worst, it can be devastating – humans are capable of all kinds of unpleasantness when they get together and support each other’s beliefs.

    It was many years ago that I came across the idea that we are much better (and more likely to be consistent) at ranking pupils’ work relatively, than we could ever be in using external criteria such as APP. In spite of this, I was never able to convince any colleagues that we should at least begin in this way. We have had years of undervaluing pupils’ writing, through fear of being seen to ‘inflate’ their levels. Unsurprisingly, our pupils always achieved much higher levels on externally marked, standardised tests.

    Dick, like you, also welcomed the demise of levels, for the reasons I’ve covered above. However, we are now working within a system for which we currently have no yardstick. We’re required to measure, but we’re measuring something with smoke. At this point, I’m ready to welcome a return to levels, like a post colonial raising the flag of the old regime.

  3. I disagree with Dick (and possibly with Dylan) that “the ‘whatever/ability’ exists only in the heads of the community that agrees on its meaning/definition”. Yes, the community consensus is vital to the business of consistent perception and practice. Yes, the internal disposition is not directly observable. But it does not follow that the internal disposition does not exist. Some people can be shown to have higher likelihoods of producing certain sorts of performance than others. Like much of the rest of the real world, we do not ultimately understand the exact nature of its reality – but we do know that it exists, because otherwise the patterns that we observe would be mere coincidence – and that is inconceivably unlikely. The use of the term “construct” by Dylan and Tim Oates is unhelpful for this reason – it suggests human dispositions are nothing more than hypotheses.

    It is essential that we agree that ability/competency/disposition/cabability – the “whatever” – really exists, because we need to understand that questions of assessment and definition are secondary, operational questions. It is vital that we avoid the siren calls of soft relativism. Daisy’s argument for comparative assessment is about the process of assessment and not the meaning of what is being assessed. In spite of our difficulty in perceiving it, some shades of blue really are darker than others.

    I agree with Juliet’s warning about moderation as “group effect” – I have argued this case in respect of the supposed “wisdom of the crowd”. But if we are to achieve consistent understandings across a large community, some form of social networking is required. Comparative judgments may be a useful way for a single person to make up their own mind about what good looks like – but I don’t think it explains how those judgments are to be aligned with each other. At first sight it makes this second task more difficult, because however inconsistent or reductive mark-schemes might be, they do at least provide one authoritative document to which everyone can refer.

    If we are going to use exemplars rather than mark-schemes, it is essential that the exemplar really is an exemplar of something else – the “whatever” – and does not become a reference that everyone will imitate, just as they used to write answers that mimicked the wording of the mark-scheme. If you then produce multiple exemplars, you have the problem of ensuring that they are consistent, one with another.

    It seems to me that this argument is going in the right direction but has quite a way still to go. And it seems to me that the missing piece will be provided by digitally-driven analytics systems that will correlate both student performance and teacher judgments, ensuring a body of exemplars that consistently illustrate the same learning objective. These exemplars can then be relied on my assessors, through a process of comparative judgement, supporting a consistent understanding of the learning objective, without the exemplars themselves becoming the learning objective.

  4. […] one of the problems with the kinds of history exams Tony Little rightly criticised is that they are too dependent on human judgment against abstract criteria, which we know is a very ineffective method of assessment.  I would like to see an element of […]

  5. […] The recent decision to abandon criterion referencing, in the form of levels, illustrates the point that individual criteria or definitions do not necessarily improve the consistent understanding of the intervals on a scale. It is actually quite hard to create short rubrics that define performance criteria reliably. The two ends of the scale (“not at all” and “completely”) can certainly be defined precisely—but the intermediate points are much more challenging. Imagining a smooth progression is probably the most consistent way of defining the intermediate intervals. See also Daisy Christodoulou’s recent arguments on the benefits of comparative marking. […]

  6. […] first came upon the idea on Daisy Christodoulou’s Wing of Heaven blog about assessment, and the idea seemed appealing. I downloaded a few academic papers about the […]

Leave a Reply

Your email address will not be published. Required fields are marked *