On Validity and reliability

A common question we get asked concerns the reliability data shown in the standard reports. To understand the usefulness of this number it is worth stepping back a little to consider reliability as a concept and its relationship to validity in assessment session design.

The search for legitimate conclusions

An important starting point is to understand that validity is not a property of a test, or a property of test scores, rather it is a property of the conclusions we draw based on test outcomes. A conclusion might for example be that "students at the top of a rank are better than students at the bottom of a rank on competency X". Validity expresses the legitimacy of this conclusion.

The level of reliability is a key metric in understanding this legitimacy. For example, if the result for a student for a certain area of competence is different tomorrow from what it was today, then any conclusions we draw cannot be warranted. The reliability data in an RM Compare session helps us to understand this.

In this regard then reliability should be subsumed within validity to be useful.

Reliability as a part of Validity

We want our student scores to reflect differences in the things we are interested in (the constructs) - if scores vary for other reasons, then that is variation that's irrelevant to the construct of interest.

This variation could be systematic (for example poor readers struggling to undertsand questions on a maths test) or it could be due to random influences. As we have written about in previous posts random variation is frequently caused by noise or bias.

Unreliability is then the random component of construct irrelevant variance. Reliability scores in RM Compare sessions are generally high because the approach reduces the random component. In doing so it helps us to be more certain about the legitimacy of our conclusions and the overall validity of any given session.

This is why we should talk about validity including reliability.

Threats to validity

There are two key threats to validity

  1. Construct under-representation means that the assessment is not considering the entirety of a concept (Considering reading but ignoring writing when assessing English for example).
  2. Construct irrelevance happens when a test ends up assessing, at least in part, something other than the thing it claims to assess. As we have already discussed there can be two reasons for this.
    1. Systematic
    2. Random

Construct Under-Representation and Systematic Construct Irrelevance are key considerations for session creators. Failure to address these adequately will reduce the legitimacy of any conclusions that can be made. In other words, validity will be reduced.

Considering washback

We have written before about the challenge of washback. This poses a very obvious and real threat to validity as it can encourage both construct under-representation and systematic constuct irrelevance.

The importance of constructs

A key understanding when thinking about validity and reliability is that the focus should not be arguments about assessment, rather it should be about what we are assessing. In other words the constructs.

Clarity and certainty around any given construct make it more likely that we will create valid and reliable assessment sessions because we are able to take an Evidence-Centred Design approach where the key question throughout is whether any proposed assessment is likely to give the evidence needed to draw the conclusions we are seeking.