Even reliable assessments can be biased
Posted on 20-09-2019
Imagine a class of 20 students, ten boys, ten girls. They all sit a maths exam which is graded from 0 – 9. On average, the boys and girls do equally well. The average grade of all the pupils is 4.5, and that is also the average of the sub group of boys, and of the sub group of the girls.
Now let’s imagine that after the students have taken this test, but before they get their results back, their teacher grades these pupils on the basis of their maths classwork and participation throughout the year. Here’s a chart showing the teacher assessment grades and the test grades.
We can see from this chart that the teacher assessments and tests are in close agreement. The overall average grade is the same for both. There’s never more than a one grade difference between an individual’s test grade and their teacher assessment grade. When we consider the measurement error inherent in any test, that’s a pretty good outcome. And if we correlate the two sets of data, we find an incredibly high correlation between them of 0.93. So surely everything is fine? Not so fast. Let’s look at the same graph, but with the difference between the boys and girls highlighted.
When we look at it like this, we can see that the teacher assessment gives us a completely different understanding of the performance of the two sub groups. We are now seeing an entire 2 grade difference on average between the boys and the girls! The girls now average 3.5, and the boys a massive 5.5! Imagine that the cut off for further study of maths was a grade 5 or above. The test grade would lead to a 50-50 gender split for further maths study. The TA grade, by contrast, would lead to a 60-40 split in favour of the boys.
This analysis shows us that it is perfectly possible for two separate scores to be very highly correlated, but for one of them still to be significantly biased. In this case, it looks as though the teacher assessment is suffering from gender bias – which, depressing as it seems, it not that surprising. There is an extensive literature on bias in teacher assessment, which shows, amongst other things, that teachers are biased in favour of boys in maths assessments, in favour of girls in English, against pupils from low-income backgrounds, with SEN, and from some ethnic minority backgrounds.
However, if we only ever looked at the correlation between test scores and teacher assessment, we’d never spot this bias. In order to see if two sets of data disagree in this way, we have to dig further and compare the averages of individual sub groups on each measure too.
Why does this matter? Recently, the Journal of Child Psychology and Psychiatry published a paper arguing for the introduction of teacher assessment in favour of tests. One of their chief pieces of evidence in favour of this was to show that teacher assessment can be trusted to be accurate because it correlates very highly with test scores. However, the paper barely discusses the issue of bias. There is just one reference to it, right at the end of the paper, where it says ‘Although teacher assessments do not always accurately reflect students’ ability because of biases and stereotypes held by teachers, our results show teacher assessments and exam scores are highly correlated both contemporaneously and over time.’ This statement particularly worries me as it implies that a high correlation is the same as an absence of bias. But as we’ve just seen, this is absolutely not the case. And that is the only mention of bias in the paper – none of the results or data discuss this issue at all.
If the underlying data really do show no bias in TA, that is a hugely significant finding, one that flies in the face of decades of research showing the opposite, including some research carried out on very similar data sets. Burgess and Greaves in 2013, and Campbell in 2015, have compared data from test scores and teacher assessment at KS1, KS2 and KS3 and found significant evidence of bias.
I’ve seen some suggestions that the data in this paper proves that the English assessment system could easily move from tests to teacher assessment. I don’t think it does: it shows that teacher assessment is correlated with tests, but we knew that already. And it offers only assertions, not data, on the key issue of bias in teacher assessment.
For the record, I think that teachers do have an incredibly important role to play in assessment, and I don’t think that national assessment systems currently make the most of them. At No More Marking, where I work, we have thought very carefully about creating assessments that use the expertise of serving teachers without introducing bias, and we’ve come up with an assessment structure which we think achieves this aim. If we really want to increase the role of teachers in assessment, we need to acknowledge the existence of bias and look at ways to work around it, not ignore it or assume that a high correlation means bias cannot exist.
Rich Davies helped with comments & graphs – thanks Rich!