Validity and reliability

Posted on 17-12-2013

This is part two of my summary of Daniel Koretz’s book Measuring Up: What Educational Testing Really Tells Us. Part one, How useful are tests?, is here. Part three, Why teaching to the test is so bad, is here.

Validity and reliability

Koretz gives the clearest and fullest explanations I’ve read of what reliability and validity mean in the context of assessment. Here’s my attempt to summarise his points.


Reliability refers to the consistency of measurement. Reliable scores show little inconsistency from one measurement to the next. So for a test to be reliable it has to provide consistent results across time. If I sit a GCSE in English one day and I sit it again the next day (obviously having done nothing in between that might improve my score), then if the test is reliable I should get the same score. If I weigh myself on a set of bathroom scales several times in a row, then if the scales are reliable I should get the same reading every time.

A test can be reliable but inaccurate. The GCSE could give me the same score both times, but it could be the wrong score. The scales could tell me I am the same weight every time, but it could be the wrong weight.


Validity does not properly refer to the properties of a test, but to the inferences you make from a test. The same test can allow you to make valid inferences about one thing, but less valid inferences about another. Koretz’s first example of this is that an exam on statistics might enable you to make good inferences about someone’s statistical abilities but weaker inferences about their general mathematical abilities.

Koretz gives a couple more examples of the way assessments can give you information to make a valid inference in one area but not another. He draws from his own experience of living and working in Israel and not speaking Hebrew very well. He imagines that during this time, he had taken the PET, which is the Israeli college admissions test. He would have ended up with an ‘appallingly low score’ because of his poor Hebrew.

What should the admissions officers have concluded about me based on my dismal scores? If they had inferred that I lacked the mathematical and other cognitive skills needed for university, they would have been wrong..Suppose the admissions officers wanted an answer to a  similar question: whether, with the additional time and language study, I could be a competent student in a Hebrew-language university programme.’

Again, to infer from his poor test score that the answer was no would have been wrong. However,

suppose they have wanted to answer a third question: whether I was at that time and with the proficiency I had then, likely to be successful in Hebrew-language university study. In that case, my low score would have been right on the money: I would have been a weak student indeed.

Validity can be compromised in many ways – Koretz explains three. First, in order to make a valid inference about a pupil’s ability in a particular domain, you have to make sure the test adequately samples from that domain. If you don’t, this is called construct underrepresentation. Your test fails to adequately represent the construct that you want to make inferences about. The opposite problem is also possible – when you test more than you are trying to make the inference about. This makes it hard to isolate an exact skill that you can make an inference about. This is construct-irrelevance variance. Finally, there are cases when a test that normally allows us to make valid inferences is used in such a way that it compromises that validity – teaching to the test and such like.

Construct-irrelevance variance

As I said in the previous post, one of the foremost assessment experts in the 60s wanted tests to assess isolated and narrow skills and knowledge. For him, the advantage of this approach was that you knew exactly what you were testing. If you designed a test that focussed very narrowly on subtraction with carrying, you can end up with a score that gives you a highly valid inference about that pupil’s ability to subtract with carrying. These types of test have very low construct-irrelevance variance.

However, an obvious criticism of this kind of testing is that whilst you may have a very valid inference about a pupil’s ability to subtract with carrying, you can’t make any kind of valid inference about how well they can subtract with carrying in real-life contexts – which is, lest we forget, the entire point of education. People were also worried that narrow tests were leading to teaching that was also very narrow. So, in the 80s assessment experts started adding in more ‘real-world’ type problems. In maths, these often meant word problems.  But this brings in the problem that Lindquist identified to begin with – now, the test is not just assessing maths ability, it is assessing literacy ability. An EAL pupil who can subtract with carrying might fail this test. The test no longer gives us a valid inference about a pupil’s maths ability. Now you might say that this is fine. In the end, a pupil has to use maths in the real world, so we should test them using more real world problems. If they can’t use maths in the real world for whatever reason, tough. Having a valid inference about whether they can use maths in the real world is more valuable than having a valid inference about whether they can subtract with carrying. Perhaps. But as Lindquist also noted, the problem with these more complex and real-world assessment items is not just that they change the type of inference you can make. They also make it much harder to identify the specifics of what a pupil is good or bad at. If a pupil does poorly on a maths in the real world assessment, you have no idea whether their problem is poor literacy, lack of background knowledge, inability to subtract with carrying, or any combination of the above. So more complex real-world assessment items might allow you to make an inference that is of more use in the real world, but they do not allow you to make inferences that help you to improve teaching and learning. The more real-world the problem, the less precise the inferences you can make and the less useful the score is for teachers and pupils.  This is rather ironic given that the main motivation for those who wanted to introduce real world assessments was to improve teaching and learning.

Construct underrepresentation

Certain kinds of test give more reliable scores. For example, multiple choice tests often give more reliable information than performance assessments. Koretz also points out, as I have been at pains to lately, that multiple choice tests can be used to test more complex skills than is often thought. However, Koretz does also concede – as would I – that one of the skills that you do really need to test through performance assessments is writing. If you only test writing through multiple choice tests, you do get a reliable score, but you don’t necessarily get one that allows you to make valid inferences about a pupil’s writing skill. That’s because a multiple choice test of writing skill will have construct underrepresentation.

Anyone who has been an English teacher will know the difficulties involved in devising and marking a test that does adequately represent the construct of writing. Marking extended writing is extremely complex and difficult to do reliably. Reliability doesn’t just mean that the same person will get a consistent score on different sittings of the test – it also means that if the same assessment was marked by several different markers, they would all give it the same mark.

In the case of testing writing, Koretz argues that we face ‘a difficult trade-off between reliability and construct-underrepresentation: to lessen the latter, one inadvertently lessens the former as well.’

Using tests in ways they shouldn’t be

Because validity is not a property of a test, but of the inferences you make, even very good tests can have their validity compromised.  The most controversial way that this can happen is through gaming. Gaming does not have to be outright cheating, and Koretz devotes an entire chapter to the ways that this kind of gaming has happened in English schools. That will be the subject of my next blog post.


What are my conclusions from all this? From my experience of administering and marking complex performance-based assessments, I know how difficult it is to ensure that you mark them reliably and that they generate information you can use to make valid and useful inferences. I feel that Koretz explains exactly why these tasks are so difficult. I feel there is a good case for using multiple choice tests much more, as these can generate very reliable information and can be used to make valid inferences about more constructs than is commonly supposed. I would use complex performance-based assessments only where the relevant construct cannot be adequately tapped using other forms of assessment.

However, it is also the case that we live in an assessment driven system. The types of assessment we select have an impact on the teaching that goes on in classrooms. We have to consider this when deciding how assessments should look. Koretz devotes a lot of time to this issue in his book, and as I say, it will be the subject of my next post.