

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
guide to reliability of test measures
Typology: Summaries
1 / 2
This page cannot be seen from the preview
Don't miss anything!
set 3, 2005 59 Validity and reliability are two key ideas in assessment. In the last issue of set I looked at the concept of validity and how it might inform the assessment decisions we make as classroom practitioners and school leaders. In this article I address the issue of reliability, and how it too can help inform our assessment strategies and practices.
Reliability refers to the consistency of the results we obtain from an assessment. This may mean:
Determining reliability has traditionally been seen as a statistical exercise. It usually involves calculating a reliability coefficient to indicate how well assessment results agree over repeated uses of the assessment tool. Reliability coefficients vary between zero and one, with zero indicating no agreement and one, total agreement (a result that is never actually obtained in educational assessment). Test developers use several methodologies to calculate reliability coefficients, depending on the type of consistency they are interested in. Some of these are briefly described below. Reliability coefficients on standardised tests are often greater than 0.9, indicating a high degree of reliability. The question of how high the reliability for a set of assessment results should be, however, depends very much on what level of decision making will be based on our assessment results. When we are dealing with assessments that are highly significant we need high reliability. When the decisions do not have lasting consequences, a high reliability measure is not as important.
The reliability coefficient for a set of test results is sometimes used to calculate what is called the standard error of measurement (SEm). This can be used to describe a band of achievement within which we can be reasonably sure a student’s true level of achievement actually lies. A true level of achievement can be thought of as the level that would be achieved by the student if the test was perfectly reliable. For example, when a score is reported as 30 with a SEm of 3, we can be reasonably confident that the true achievement level of the individual would fall somewhere in the range 27 to 33. In educational testing it is considered good practice to report the SEm. In the Progressive Achievement Test of Reading, for instance, the manual writers provide an estimate of the SEm for each individual test (which is usually around 3 marks). Being aware of the SEm helps us regard a student’s test score as describing a range rather than a precise point.
It is unlikely that teachers will spend time calculating reliability coefficients or reporting standard errors of measurement for the assessment tasks or tests they develop
Charles Darr Table 1 METHODOLOGIES FOR DETERMINING RELIABILITY Method Process Test/retest The same test, or an equivalent version, is administered at two different times to the same group of students. The two sets of results are then compared to calculate the reliability coefficient. This method provides an indication of how consistent the results are over time or between equivalent forms of the same test. Internal consistency The results on different tasks or sections of an assessment are compared to see how well they relate. Several different methods can be used, including dividing the test into two halves and comparing the results on each half (split- half method). Other instances of this type of reliability coefficient involve more sophisticated statistical methods. In test manuals it is common to see what is called Cronbach’s alpha, or the use of Kuder Richardson formula 20 or 21. Calculating this type of reliability coefficient provides an indication of how consistently the items or tasks within an assessment promote the same result. Inter-rater reliability Results from different markers can be compared to ascertain the level of agreement. This method used to show how consistently two or more assessors are scoring the same tasks, is called moderation when it is used in the context of assessment for qualifications.
60 set 3, 2005 themselves. This does not mean, however, that the issue of reliability is irrelevant to classroom assessment—it is very important to base judgements and decisions about students on assessment results that are dependable. So how can classroom teachers ensure that their own assessments are reliable? Jeffrey Smith (2003) argues for an alternative definition of reliability that he believes is more appropriate for classroom-based assessment. He proposes the idea of “sufficiency of information”. For Smith, judgements about the reliability of classroom assessments can be built on a question such as: “Does this assessment provide me with enough information to make a judgement of each student’s level of accomplishment with regard to this learning?” (Smith, 2003, p. 26). Taylor and Nolan (1996) also promote this point of view. According to them, “A wide range of assessments can serve the purpose of a long test—the more sources of information, with demonstrable evidence for validity, the more likely dependable decisions can be made” (p. 11). This kind of definition begs the question: how much information is enough? There is a fine line between “enough information” and too much assessment. There is of course no easy answer to this; in the end, it comes down to a professional judgement. Some practical advice on this issue, however, is provided by Anne Davies (2000). She suggests that teachers use the concept of triangulation as a way of increasing the reliability and validity of classroom assessments. Triangulation involves using three different sources of assessment evidence as the basis for any decision making. For Davies, these areas are observations of learning, products students create (including test results), and learning conversations. When teachers collect and consider evidence from each of these sources, they are far more likely to reach dependable and meaningful conclusions about students’ progress than when they rely on one single area or result alone. The issue of reliability alerts us to the fact that random variation does occur in assessment. It is something that should concern us when we assess students, particularly when the results of assessments are used to make decisions about individual students and/or the teaching and Charles Darr is a senior researcher at the New Zealand Council for Educational Research. Email : charles.darr@nzcer.org.nz learning programmes they are involved in. As in my previous article on validity, this is only a “hitchhiker’s guide” to what is a very important assessment concept—there is a lot more we could say about reliability. This may, however, serve as a reminder that it is important to take issues such as reliability into account if we are to make informed decisions about both the process of assessment and the results.
Chase, C.I. (1974). Measurement for educational evaluation. Reading, MA: Addison-Wesley. Davies, A. (2000). Making classroom assessment work. C ou r tenay, BC : C on ne c t ion s Publishing. Smith, J.K. (2003). Reconsidering reliability in classroom assessment and grading. Educational Measurement: Issues and Practice , 22 (4), 26–33. Taylor, C.S., & Nolen, S.B. (1996). What does the psychometrician’s classroom look like?: Reframing assessment concepts in the context of learning. Education Policy Analysis Archives , 4 (17). Retrieved from http://epaa. asu.edu/epaa/v4n17.html
1 It is important to note that unreliable results will not lead to valid inferences about student achievement. However, just because assessment results are reliable does not mean that we are assessing what counts. Reliability is therefore a necessary but not sufficient condition for validity. ASSESSMENT NEWS Factors affecting reliability (and reliability coefficients)