Opinion

On Validity and reliability

By Mark House

20th jun 2024

A common question we get asked concerns the reliability data shown in the standard reports. To understand the usefulness of this number it is worth stepping back a little to consider reliability as a concept and its relationship to validity in assessment session design.

The search for legitimate conclusions

An important starting point is to understand that validity is not a property of a test, or a property of test scores, rather it is a property of the conclusions we draw based on test outcomes. A conclusion might for example be that "students at the top of a rank are better than students at the bottom of a rank on competency X". Validity expresses the legitimacy of this conclusion.

The level of reliability is a key metric in understanding this legitimacy. For example, if the result for a student for a certain area of competence is different tomorrow from what it was today, then any conclusions we draw cannot be warranted. The reliability data in an RM Compare session helps us to understand this.

In this regard then reliability should be subsumed within validity to be useful.

Reliability as a part of Validity

We want our student scores to reflect differences in the things we are interested in (the constructs) - if scores vary for other reasons, then that is variation that's irrelevant to the construct of interest.

This variation could be systematic (for example poor readers struggling to undertsand questions on a maths test) or it could be due to random influences. As we have written about in previous posts random variation is frequently caused by noise or bias.

Unreliability is then the random component of construct irrelevant variance. Reliability scores in RM Compare sessions are generally high because the approach reduces the random component. In doing so it helps us to be more certain about the legitimacy of our conclusions and the overall validity of any given session.

This is why we should talk about validity including reliability.

Threats to validity

There are two key threats to validity

Construct under-representation means that the assessment is not considering the entirety of a concept (Considering reading but ignoring writing when assessing English for example).
Construct irrelevance happens when a test ends up assessing, at least in part, something other than the thing it claims to assess. As we have already discussed there can be two reasons for this.
1. Systematic
2. Random

Construct Under-Representation and Systematic Construct Irrelevance are key considerations for session creators. Failure to address these adequately will reduce the legitimacy of any conclusions that can be made. In other words, validity will be reduced.

Considering washback

We have written before about the challenge of washback. This poses a very obvious and real threat to validity as it can encourage both construct under-representation and systematic constuct irrelevance.

The importance of constructs

A key understanding when thinking about validity and reliability is that the focus should not be arguments about assessment, rather it should be about what we are assessing. In other words the constructs.

Clarity and certainty around any given construct make it more likely that we will create valid and reliable assessment sessions because we are able to take an Evidence-Centred Design approach where the key question throughout is whether any proposed assessment is likely to give the evidence needed to draw the conclusions we are seeking.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP