Why teaching to the test is so bad

Posted on 19-01-2014

This is part three of my summary of Daniel Koretz’s book Measuring Up: What Educational Testing Really Tells Us. Part one, How useful are tests?, is here and part two, Validity and reliability, is here.

In my last post I spoke about how tests are only ever a proxy for what we want to measure. Koretz goes into great detail on this point and comes up with a brilliant analogy to help you understand it. Tests are about making a measurement, and generally, tests are trying to measure something huge.  The technical term for what we are trying to measure is the domain. The domain that tests are trying to measure is the extent of a pupil’s knowledge and skills and their ability to apply these skills in the real world in the future. The domain is vast, and normally we have just a two or three hour exam to try and measure this domain. It’s for this reason that you often hear critics of tests say ‘but how can you measure everything a pupil can do in just one two-hour exam’? And it is for this reason, of course, that tests are so daunting – I can vividly remember before one of my A-levels thinking how odd it was that all the work I had done over the last two years and all of the life I was to lead over the next three were going to be largely determined by what I did in one three hour exam. Often, critics move from noting this point to dismissing tests altogether – as Peter Tait seems to do here at the end of a recent article in the Telegraph, where he quotes a famous AS Neill line about the limitations of tests.

If we have to have an exam at eleven, let us make it one for humour, sincerity, imagination, character – and where is the examiner who could test such qualities?

It is correct to note that tests can never measure an entire domain. But it is not correct to dismiss tests on this basis. Tests can only ever sample the domain they measure. But thanks to statistical techniques and the skill of test-setters, this sample can provide a fairly reliable proxy of the entire domain. Koretz is also keen to point out, contra Neill, that exams can measure things that are of value: ‘the evidence shows that standardized tests can measure a great deal that is of value.’ Koretz’s analogy here is with opinion polling. Opinion polls are trying to measure vast domains. In Britain, opinion pollsters are trying to work out the voting intentions of 40 million voters. They do so on the basis of a sample of 1,000 voters. On the face of it, this seems ridiculous. How on earth can you poll 1,000 people out of a population of 40 million and hope to get any kind of accurate measure about the 40 million? But of course, you can. Opinion pollsters use statistical techniques to ensure that their sample is a representative one, and as a result, more often than not they are right. Of course, they aren’t always right, just as exam results aren’t always right. It certainly is less accurate to measure a sample than it is to measure the entire domain. But done well, opinion polls and exams can both be extremely accurate. And where there is inaccuracy or uncertainty, we have methods of measuring those too. On the face of it, it seems absurd that 1,000 people can tell us about 40 million, and that two hours can tell us about a person’s skills and knowledge. But if the poll or test are designed carefully enough, then this small sample is enough to tell us quite a lot. Of course, careful design in both cases involves a lot of complexity. Before the 2012 US election, there were lots of arguments about the correct way of polling and about Nate Silver’s particularly uncanny methods of polling. And the recent PISA tests came in for criticism about their statistical complexities. Koretz’s argument is that these complexities are important and can’t just be dismissed.

Many people simply dismiss these complexities, treating them as unimportant precisely because they seem technical and esoteric…This proclivity to associate the arcane with the unimportant is both ludicrous and pernicious. When we are ill, most of us rely on medical treatments that reflect complex, esoteric knowledge of all manner of physiological and biomedical processes that very few of us understand well. Yet few of us would tell our doctors, that their knowledge, or that of the biomedical researchers who designed the drugs we take, can’t possibly be important because to our uninformed ears it is arcane.

In the case of opinion polls, it would be prohibitively expensive and impractical to survey the entire electorate every week or so. However, eventually the entire electorate is measured, in the form of a general election. In the case of exams, it is essentially impossible to measure the entire domain.

If we wanted to know whether schools successfully imparted the skills and dispositions needed to use algebra successfully in later work, we would go observe students later in life to see whether they used algebra when appropriate and whether they were successful in their applications of it.

There are clearly all kinds of practical problems with this. You’d only get a measure years after pupils had finished school, it would be hard to standardise such tests and it would be hugely expensive. And some things you might want to measure would be hard to observe. The result is that we need standardized tests which elicit a certain behaviour and which are the same for everyone. These tests are proxies or samples for that far wider domain which is essentially impossible to measure on its own. So, Koretz’s point is that tests are only ever a sample of the wider domain. There is no final ‘general election’ which will give us the actual measure of the domain. We can only ever sample from the domain, and then use that sample to make an inference about the wider domain. We can never measure the domain in the way we can measure a table or weigh some ingredients. However, thanks to the power of statistics and the accumulated research we have about exams, these samples can allow us to make very valid inferences about the domain.

But what happens if, for some reason, the sample stops being a good proxy for the wider domain? This is of course why cheating is such a problem for test-setters. If a pupil can get hold of the test questions in advance and work out the answers, then they can score a top mark on the test without having the mastery of the domain that a top mark is meant to imply. This isn’t just true of educational assessment. Koretz gives lots of real examples of situations where knowing the sample that is used for making inferences has led to those inferences being compromised. For example, the US postal service use a random sample of 1,000 addresses to check the speed of postal deliveries. Some workers found out the sample addresses and made sure that those addresses received a very speedy delivery. Those addresses did indeed receive a very speedy delivery, but the inference it allowed you to make about the entire domain was compromised. So far, I guess a lot of these implications are fairly obvious. We all know that if someone gets hold of a test paper in advance, their scores on it are no longer valid. And we are all aware that tests can’t cover everything. That’s why they need to be secret – so that you end up learning and revising everything because you don’t know just what bit turns up on the exam.

However, there are two particular ways in which I think Koretz goes even further than this, and in so doing makes an argument that challenges a lot of very common teaching practice. First, the point that Koretz is making with the above analogies and example is that the domain is vast. It isn’t just that the domain is bigger than the test – that much, as I have said, is obvious. It’s that the domain is even bigger than the syllabus. In fact, the domain is even bigger than the school curriculum. Second, the point Koretz is making with the postal service example is that if you teach to the test, then a pupil may well genuinely improve on those test items. But the point of a test score is not actually to tell you how well pupils have done on those particular items. The point of a test score is to allow you to make an inference about the wider domain. And the implication of this is that it isn’t just outright cheating which compromises the validity of the inferences we make from tests. Certain kinds of test preparation compromise validity too. Here is the logic: the test is a sample from the domain. The syllabus is also a sample from the domain. In order for the test to provide a valid inference about how a pupil will perform on the entire domain, teaching must be geared towards the domain. If teaching is geared towards the test, that compromises the result. But even if teaching is geared towards the syllabus, that can compromise the result too.

Koretz gives some examples of how teaching to the test and syllabus, whilst not cheating, nevertheless results in inflated exam scores. He describes 7 common reactions to the introduction of high stakes tests.

1. Working more effectively 2. Teaching more 3. Working harder 4. Reallocation 5. Alignment 6. Coaching 7. Cheating.

Koretz says that 1-3 result in genuine gains, 7 always results in false gains and 4-6 can result in either genuine or false gains depending on how they are used. Reallocating time generates false gains if you are taking time away from things that are also an important part of the domain. Alignment is when you match the teaching to the test syllabus. ‘Coaching refers to focussing instruction on small details of the test, many of which have no substantive meaning.’

I think all three of these tactics are used in the most damaging ways in English schools. When it comes to coaching, I have often noticed how pupils who cannot tell you one date from history are able to tell you the number of marks available for each question on a history exam paper. This is perhaps one reason why memorisation gets such a bad name. Memorisation of the right things – for example, times tables and verb tales – is extraordinarily valuable, but memorisation of the wrong things – for example, marks to minute ratios and exam cheats and hints – is clearly not.

In terms of alignment and reallocation, here are some examples from my own experience. When I studied A-level history, one of the modules was the Russian Revolution. The syllabus made no mention of medieval absolutism and the difference in land distribution in early modern England and Russia. Yet the very first lesson I had on Russian history was on these topics, and these topics were enormously helpful in getting me to understand the part of Russian history that was actually in the syllabus.

I think that if that teacher’s scheme of work were analysed by a headteacher or senior leader keen to improve grades, then those introductory lessons on medieval and early modern Europe would be the first to ‘reallocated’ or ‘aligned’ out of existence. I worry that even without a head teacher or senior leader getting involved, the pressure to do well on the exam might lead to the teacher making those cuts herself. And I worry that making those cuts in order to focus on ‘exam skills’ might even improve results on the exam even as it reduced the pupil’s wider understanding of the domain – ie, even as it reduced the validity of the inference we can make from the exam result.

Another example: when I studied A-level history, one of the modules was on the Vietnam War. Essentially, we worked out from looking at past papers and the syllabus that there were basically six essay questions you could get in the final exam. There were two essays each year, and you could make a fair bet that they wouldn’t repeat the two from the previous year. There was one question that hadn’t been asked at all, so you felt confident that that would come up. So that meant there were three other essay questions which you felt fairly sure would come up. Now, this kind of analysis was something we did at the very end of the course after we’d learnt all the material. But it would have been entirely possible for the teacher to have done that analysis at the start and then just taught us the answers to those four essay questions. In fact, it would have been entirely possible for her to have given us model answers to those four questions and told us to go away and learn them. We could have done that and got very very good grades on the exam without cheating. It would also have taken a lot less time than actually learning about the Vietnam war. The gains from this approach would have been particularly valuable for weaker students.

I think that these kind of tactics go on in schools now, and I also think that in some cases they are even recommended and praised. I think one of the reasons why such tactics are praised is they involve the teacher and the student working very hard. It’s not easy to write model answers, and it isn’t easy to memorise them. Clearly, such a tactic is not cheating. But it does completely compromise the validity of the test. Koretz’s argument is that because tests can only ever be samples of the domain, there is no possibility of an optimal test. There is no way we can measure the domain. Hence, all tests will be to some extent imperfect, and if a teacher tries hard to game them, they will be able to. I accept this. This is one way in which Koretz has genuinely changed my mind. I used to think that the problem of excessive test prep was one of badly designed tests. That is, I used to think that it was OK to teach to the test if the test was worth teaching to. Koretz takes on this exact point and shows that it is false. He’s convinced me. Even the best test in the world is only a sample, and samples can be gamed.

However, there is one aspect where I would depart from him. Whilst I accept that all tests are to some extent imperfect, there are clearly varying degrees of imperfection. Some of the tests Koretz describes are so flawed that they are crying out to be gamed. For example, he gives an example of one teacher who didn’t teach her pupils about irregular polygons because they never came up on the test. Only regular polygons appeared on the test. Koretz is highly critical of the teacher for this. I see his point, but I think it is also the case that we can blame the test-setter in this case for not including irregular polygons. Including irregular polygons surely can’t be that much of a burden for the test-setters. I accept that as a profession, we have to move away from teaching to the test, and that policymakers also have to stop encouraging it. But I think that assessment experts and test designers like Koretz must also design exams that are difficult to game.

A couple of years ago, I was at a conference where a deputy head of a successful school presented the amazing exam results his school had achieved over the past few years, and outlined some of the ways he’d achieved them. In the question and answer session afterwards, I asked how he could be sure that these results were down to improved teaching and learning and not just teaching to the test. His reply was that all of his teachers and students were working harder, and that it was so much easier nowadays for pupils and teachers to download past papers from the internet and work on their areas of weakness. I wasn’t totally convinced by this argument, but at the time it did seem to be making a fairly good point, which I’ve since heard many other people make: ‘pupils nowadays have better resources which they can access more quickly than in the past and this contributes to them learning more and becoming smarter.‘ But having read Koretz, this argument does not stand up at all. Teaching which focusses on past papers and test prep is not teaching to the domain. It’s teaching to the sample. The improvements it generates are not likely to be genuine.

What is also particularly worrying about this example is that even when asked to identify a form of teaching and learning which was not test prep, this deputy referred to methods of test prep. Teaching to the test has become teaching and learning. It is hard for many people to have a model of improved teaching and learning which is not teaching to the test. It is for this reason that I am so keen on schools using a rigorous and detailed curriculum. The curriculum is not as wide as the domain either, but it is much wider than either the syllabus or the exam. Unlike the syllabus and exam, it is designed with teaching and learning in mind, not assessment. As Tim Oates has shown here, and as I have tried to argue here, if you have a vague curriculum then the result is not ‘teacher freedom’. The result is that the syllabus and/or the test become the curriculum, with hugely damaging consequences.