Which type(s) of reliability estimates would be appropriate for a speed test?

From the perspective of the creative thinker or innovator, consistency can be viewed as problematic. Consistent thinking leads to more of the same, as it limits diversity and change. On the other hand, inconsistent thinking or thinking outside the box produces new methods and ideas, inventions and breakthroughs, leading to innovation and growth.

Standardized tests are designed to be consistent, and, by their very nature, they tend to be poor measures of creative thinking. In fact, the construct of creativity is one of the more elusive in educational and psychological testing. Although published tests of creative thinking and problem solving exist, administration procedures are complex and consistency in the resulting scores can be, unsurprisingly, very low. Creativity seems to involve an inconsistency in thinking and behavior that is challenging to measure reliably.

From the perspective of the test maker or test taker, which happens to be the perspective we take in this book, consistency is critical to valid measurement. An inconsistent or unreliable test produces unreliable results that only inconsistently support the intended inferences of the test. Nearly a century of research provides us with a framework for examining and understanding the reliability of test scores, and, most importantly, how reliability can be estimated and improved.

This chapter introduces reliability within the framework of the classical test theory [CTT] model, which is then extended to generalizability [G] theory. In Chapter 7, we’ll learn about reliability within the item response theory model. These theories all involve measurement models, sometimes referred to as latent variable models, which are used to describe the construct or constructs assumed to underlie responses to test items.

This chapter starts with a general definition of reliability in terms of consistency of measurement. The CTT model and assumptions are then presented in connection with statistical inference and measurement models, which were introduced in Chapter 1. Reliability and unreliability, that is, the standard error of measurement, are discussed as products of CTT, and the four main study designs and corresponding methods for estimating reliability are reviewed. Finally, reliability is discussed for situations where scores come from raters. This is called interrater reliability, and it is best conceptualized using G theory.

In this chapter, we’ll conduct reliability analyses on PISA09 data using epmr, and plot results using ggplot2. We’ll also simulate some data and examine interrater reliability using epmr.

Figure 5.2 contains a plot similar to the one in Figure 5.1 where we identified \[X\], \[T\], and \[E\]. This time, we have scores on two reading test forms, with the first form is now called \[X_1\] and second form is \[X_2\], and we’re going to focus on the overall distances of the points from the line that goes diagonally across the plot. Once again, this line represents truth. A person with a true score of 11 on \[X_1\] will score 11 on \[X_2\], based on the assumptions of the CTT model.

Although the solid line represents what we’d expect to see for true scores, we don’t actually know anyone’s true score, even for those students who happen to get the same score on both test forms. The points in Figure 5.2 are all observed scores. The students who score the same on both test forms do indicate more consistent measurement. However, it could be that their true score still differs from observed. There’s no way to know. To calculate truth, we would have to administer the test an infinite number of times, and then take the average, or simply simulate it, as in Figure 5.1.

# Simulate scores for a new form of the reading test 
# called y
# rho is the made up reliability, which is set to 0.80
# x is the original reading total scores
# Form y is slightly easier than x with mean 6 and SD 3
xysim 

Chủ Đề