Which type(s) of reliability estimates would be appropriate for a speed test?

From the perspective of the creative thinker or innovator, consistency can be viewed as problematic. Consistent thinking leads to more of the same, as it limits diversity and change. On the other hand, inconsistent thinking or thinking outside the box produces new methods and ideas, inventions and breakthroughs, leading to innovation and growth.

Standardized tests are designed to be consistent, and, by their very nature, they tend to be poor measures of creative thinking. In fact, the construct of creativity is one of the more elusive in educational and psychological testing. Although published tests of creative thinking and problem solving exist, administration procedures are complex and consistency in the resulting scores can be, unsurprisingly, very low. Creativity seems to involve an inconsistency in thinking and behavior that is challenging to measure reliably.

From the perspective of the test maker or test taker, which happens to be the perspective we take in this book, consistency is critical to valid measurement. An inconsistent or unreliable test produces unreliable results that only inconsistently support the intended inferences of the test. Nearly a century of research provides us with a framework for examining and understanding the reliability of test scores, and, most importantly, how reliability can be estimated and improved.

This chapter introduces reliability within the framework of the classical test theory (CTT) model, which is then extended to generalizability (G) theory. In Chapter 7, we’ll learn about reliability within the item response theory model. These theories all involve measurement models, sometimes referred to as latent variable models, which are used to describe the construct or constructs assumed to underlie responses to test items.

This chapter starts with a general definition of reliability in terms of consistency of measurement. The CTT model and assumptions are then presented in connection with statistical inference and measurement models, which were introduced in Chapter 1. Reliability and unreliability, that is, the standard error of measurement, are discussed as products of CTT, and the four main study designs and corresponding methods for estimating reliability are reviewed. Finally, reliability is discussed for situations where scores come from raters. This is called interrater reliability, and it is best conceptualized using G theory.

In this chapter, we’ll conduct reliability analyses on PISA09 data using epmr, and plot results using ggplot2. We’ll also simulate some data and examine interrater reliability using epmr.

Figure 5.2 contains a plot similar to the one in Figure 5.1 where we identified \(X\), \(T\), and \(E\). This time, we have scores on two reading test forms, with the first form is now called \(X_1\) and second form is \(X_2\), and we’re going to focus on the overall distances of the points from the line that goes diagonally across the plot. Once again, this line represents truth. A person with a true score of 11 on \(X_1\) will score 11 on \(X_2\), based on the assumptions of the CTT model.

Although the solid line represents what we’d expect to see for true scores, we don’t actually know anyone’s true score, even for those students who happen to get the same score on both test forms. The points in Figure 5.2 are all observed scores. The students who score the same on both test forms do indicate more consistent measurement. However, it could be that their true score still differs from observed. There’s no way to know. To calculate truth, we would have to administer the test an infinite number of times, and then take the average, or simply simulate it, as in Figure 5.1.

# Simulate scores for a new form of the reading test 
# called y
# rho is the made up reliability, which is set to 0.80
# x is the original reading total scores
# Form y is slightly easier than x with mean 6 and SD 3
xysim <- rsim(rho = .8, x = scores$x1, meany = 6, sdy = 3)
scores$x2 <- round(setrange(xysim$y, scores$x1))
ggplot(scores, aes(x1, x2)) +
  geom_point(position = position_jitter(w = .3, h = .3)) +
  geom_abline(col = "blue")

Which type(s) of reliability estimates would be appropriate for a speed test?

Figure 5.2: PISA total reading scores and scores on a simulated second form of the reading test.

The assumptions of CTT make it possible for us to estimate the reliability of scores using a sample of individuals. Figure 5.2 shows scores on two test forms, and the overall scatter of the scores from the solid line gives us an idea of the linear relationship between them. There appears to be a strong, positive, linear relationship. Thus, people tend to score similarly from one form to the next, with higher scores on one form corresponding to higher scores on the other. The correlation coefficient for this data set, cor(scores$x, scores$y) = 0.802, gives us an estimate of how similar scores are, on average from \(X_1\) to \(X_2\). Because the correlation is positive and strong for this plot, we would expect a person’s score to be pretty similar from one testing to the next.

Imagine if the scatter plot were instead nearly circular, with no clear linear trend from one test form to the next. The correlation in this case would be near zero. Would we expect someone to receive a similar score from one test to the next? On the other hand, imagine a scatter plot that falls perfectly on the line. If you score, for example, 10 on one form, you also score 10 on the other. The correlation in this case would be 1. Would we expect scores to remain consistent from one test to the next?

We’re now ready for a statistical definition of reliability. In CTT, reliability is defined as the proportion of variability in \(X\) that is due to variability in true scores \(T\):

\[\begin{equation} r = \frac{\sigma^2_T}{\sigma^2_X}. \tag{5.3} \end{equation}\]

Note that true scores are assumed to be constant in CTT for a given individual, but not across individuals. Thus, reliability is defined in terms of variability in scores for a population of test takers. Why do some individuals get higher scores than others? In part because they actually have higher abilities or true scores than others, but also, in part, because of measurement error. The reliability coefficient in Equation (5.3) tells us how much of our observed variability in \(X\) is due to true score differences.

Unfortunately, we can’t ever know the CTT true scores for test takers. So we have to estimate reliability indirectly. One indirect estimate made possible by CTT is the correlation between scores on two forms of the same test, as represented in Figure 5.2:

\[\begin{equation} r = \rho_{X_1 X_2} = \frac{\sigma_{X_1 X_2}}{\sigma_{X_1} \sigma_{X_2}}. \tag{5.4} \end{equation}\]

This correlation is estimated as the covariance, or the shared variance between the distributions on two forms, divided by a product of the standard deviations, or the total available variance within each distribution.

There are other methods for estimating reliability from a single form of a test. The only ones presented here are split-half reliability and coefficient alpha. Split-half is only presented because of its connection to what’s called the Spearman-Brown reliability formula. The split-half method predates coefficient alpha, and is computationally simpler. It takes scores on a single test form, and separates them into scores on two halves of the test, which are treated as separate test forms. The correlation between these two halves then represents an indirect estimate of reliability, based on Equation (5.3).

# Split half correlation, assuming we only had scores on 
# one test form
# With an odd number of reading items, one half has 5 
# items and the other has 6
xsplit1 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[1:5]])
xsplit2 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[6:11]])
cor(xsplit1, xsplit2, use = "complete")
## [1] 0.624843

The Spearman-Brown formula was originally used to correct for the reduction in reliability that occurred when correlating two test forms that were only half the length of the original test. In theory, reliability will increase as we add items to a test. Thus, Spearman-Brown is used to estimate, or predict, what the reliability would be if the half-length tests were made into full-length tests.

# sb_r() in the epmr package uses the Spearman-Brown 
# formula to estimate how reliability would change when 
# test length changes by a factor k
# If test length were doubled, k would be 2
sb_r(r = cor(xsplit1, xsplit2, use = "complete"), k = 2)
## [1] 0.7691119

The Spearman-Brown formula also has other practical uses. Today, it is most commonly used during the test development process to predict how reliability would change if a test form were reduced or increased in length. For example, if you are developing a test and you gather pilot data on 20 test items with a reliability estimate of 0.60, Spearman-Brown can be used to predict how this reliability would go up if you increased the test length to 30 or 40 items. You could also pilot test a large number of items, say 100, and predict how reliability would decrease if you wanted to use a shorter test.

The Spearman-Brown reliability, \(r_{new}\), is estimated as a function of what’s labeled here as the old reliability, \(r_{old}\), and the factor by which the length of \(X\) is predicted to change, \(k\):

\[\begin{equation} r_{new} = \frac{kr_{old}}{(k - 1)r_{old} + 1}. \tag{5.5} \end{equation}\]

Again, \(k\) is the factor by which the test length is increased or decreased. It is equal to the number of items in the new test divided by the number of items in the original test. Multiply \(k\) by the old reliability, and then divided the result by \((k - 1)\) times the old reliability, plus 1. For the example mentioned above, going from 20 to 30 items, we have \((30/20 \times 0.60)\) divided by \((30/20 - 1) \times 0.60 + 1 = 0.69\). Going to 40 items, we have a new reliability of 0.75. The epmr package contains sb_r(), a simple function for estimating the Spearman-Brown reliability.

Alpha is arguably the most popular form of reliability. Many people refer to it as “Chronbach’s alpha,” but Chronbach himself never intended to claim authorship for it and in later years he regretted the fact that it was attributed to him (see Cronbach and Shavelson 2004). The popularity of alpha is due to the fact that it can be calculated using scores from a single test form, rather than two separate administrations or split halves. Alpha is defined as

\[\begin{equation} r = \alpha = \left(\frac{J}{J - 1}\right)\left(\frac{\sigma^2_X - \sum\sigma^2_{X_j}}{\sigma^2_X}\right), \tag{5.6} \end{equation}\]

where \(J\) is the number of items on the test, \(\sigma^2_X\) is the variance of observed total scores on \(X\), and \(\sum\sigma^2_{X_j}\) is the sum of variances for each item \(j\) on \(X\). To see how it relates to the CTT definition of reliability in Equation (5.3), consider the top of the second fraction in Equation (5.6). The total test variance \(\sigma^2_X\) captures all the variability available in the total scores for the test. We’re subtracting from it the variances that are unique to the individual items themselves. What’s left over? Only the shared variability among the items in the test. We then divide this shared variability by the total available variability. Within the formula for alpha you should see the general formula for reliability, true variance over observed.

# epmr includes rstudy() which estimates alpha and a 
# related form of reliability called omega, along with 
# corresponding SEM
# You can also use coef_alpha() to obtain coefficient 
# alpha directly
rstudy(PISA09[, rsitems])
## 
## Reliability Study
## 
## Number of items: 11 
## 
## Number of cases: 44878 
## 
## Estimates:
##        coef  sem
## alpha 0.760 1.40
## omega 0.763 1.39

Keep in mind, alpha is an estimate of reliability, just like the correlation is. So, any equation requiring an estimate of reliability, like SEM below, can be computed using either a correlation coefficient or an alpha coefficient. Students often struggle with this point: correlation is one estimate of reliability, alpha is another. They’re both estimating the same thing, but in different ways based on different reliability study designs.

Now that we’ve defined reliability in terms of the proportion of observed variance that is true, we can define unreliability as the portion of observed variance that is error. This is simply 1 minus the reliability:

\[\begin{equation} 1 - r = \frac{\sigma^2_E}{\sigma^2_X}. \tag{5.7} \end{equation}\]

Typically, we’re more interested in how the unreliability of a test can be expressed in terms of the available observed variability. Thus, we multiply the unreliable proportion of variance by the standard deviation of \(X\) to obtain the SEM:

\[\begin{equation} SEM = \sigma_X\sqrt{1 - r}. \tag{5.8} \end{equation}\]

The SEM is the average variability in observed scores attributable to error. As any statistical standard error, it can be used to create a confidence interval (CI) around the statistic that it estimates, that is, \(T\). Since we don’t have \(T\), we instead create the confidence interval around \(X\) to index how confident we are that \(T\) falls within it for a given individual. For example, the verbal reasoning subtest of the GRE is reported to have a reliability of 0.93 and an SEM of 2.2, on a scale that ranges from 130 to 170. Thus, an observed verbal reasoning score of 155 has a 95% confidence interval of about \(\pm 4.4\) points. At \(X = 155\), we are 95% confident that the true score falls somewhere between 150.8 and 159.2. (Note that scores on the GRE are actually estimated using IRT.)

Confidence intervals for PISA09 can be estimated in the same way. First, we choose a measure of reliability, find the SD of observed scores, and obtain the corresponding SEM. Then, we can find the CI, which gives us the expected amount of uncertainty in our observed scores due to random measurement error. Here, we’re calculating SEM and the CI using alpha, but other reliability estimates would work as well. Figure 5.3 shows the 11 possible PISA09 reading scores in order, with error bars based on SEM for students in Belgium.

# Get alpha and SEM for students in Belgium
bela <- coef_alpha(PISA09[PISA09$cnt == "BEL", rsitems])$alpha
# The sem function from epmr sometimes overlaps with sem from 
# another R package so we're spelling it out here in long 
# form
belsem <- epmr::sem(r = bela, sd = sd(scores$x1,
  na.rm = T))
# Plot the 11 possible total scores against themselves
# Error bars are shown for 1 SEM, giving a 68% confidence
# interval and 2 SEM, giving the 95% confidence interval
# x is converted to factor to show discrete values on the 
# x-axis
beldat <- data.frame(x = 1:11, sem = belsem)
ggplot(beldat, aes(factor(x), x)) +
  geom_errorbar(aes(ymin = x - sem * 2,
    ymax = x + sem * 2), col = "violet") +
  geom_errorbar(aes(ymin = x - sem, ymax = x + sem),
    col = "yellow") +
  geom_point()

Which type(s) of reliability estimates would be appropriate for a speed test?

Figure 5.3: The PISA09 reading scale shown with 68 and 95 percent confidence intervals around each point.

Figure 5.3 helps us visualize the impact of unreliable measurement on score comparisons. For example, note that the top of the 95% confidence interval for \(X\) of 2 extends nearly to 5 points, and thus overlaps with the CI for adjacent scores 3 through 7. It isn’t until \(X\) of 8 that the CI no longer overlap. With a CI of

# Split half correlation, assuming we only had scores on 
# one test form
# With an odd number of reading items, one half has 5 
# items and the other has 6
xsplit1 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[1:5]])
xsplit2 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[6:11]])
cor(xsplit1, xsplit2, use = "complete")
## [1] 0.624843
0 1.425, we’re 95% confident that students with observed scores differing at least by
# Split half correlation, assuming we only had scores on 
# one test form
# With an odd number of reading items, one half has 5 
# items and the other has 6
xsplit1 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[1:5]])
xsplit2 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[6:11]])
cor(xsplit1, xsplit2, use = "complete")
## [1] 0.624843
1 5.7 have different true scores. Students with observed scores closer than
# Split half correlation, assuming we only had scores on 
# one test form
# With an odd number of reading items, one half has 5 
# items and the other has 6
xsplit1 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[1:5]])
xsplit2 <- rowSums(PISA09[PISA09$cnt == "BEL", 
  rsitems[6:11]])
cor(xsplit1, xsplit2, use = "complete")
## [1] 0.624843
1 may actually have the same true scores.

There are no agreed-upon standards for interpreting reliability coefficients. Reliability is bound by 0 on the lower end and 1 at the upper end, because, by definition, the amount of true variability can never be less or more than the total available variability in \(X\). Higher reliability is clearly better, but cutoffs for acceptable levels of reliability vary for different fields, situations, and types of tests. The stakes of a test are an important consideration when interpreting reliability coefficients. The higher the stakes, the higher we expect reliability to be. Otherwise, cutoffs depend on the particular application.

In general, reliabilities for educational and psychological tests can be interpreted using scales like the ones presented in Table 5.1. With medium-stakes tests, a reliability of 0.70 is sometimes considered minimally acceptable, 0.80 is decent, 0.90 is quite good, and anything above 0.90 is excellent. High stakes tests should have reliabilities at or above 0.90. Low stakes tests, which are often simpler and shorter than higher-stakes ones, often have reliabilities as low as 0.70. These are general guidelines, and interpretations can vary considerably by test. Remember that the cognitive measures in PISA would be considered low-stakes at the student level.

A few additional considerations are necessary when interpreting coefficient alpha. First, alpha assumes that all items measure the same single construct. Items are also assumed to be equally related to this construct, that is, they are assumed to be parallel measures of the construct. When the items are not parallel measures of the construct, alpha is considered a lower-bound estimate of reliability, that is, the true reliability for the test is expected to be higher than indicated by alpha. Finally, alpha is not a measure of dimensionality. It is frequently claimed that a strong coefficient alpha supports the unidimensionality of a measure. However, alpha does not index dimensionality. It is impacted by the extent to which all of the test items measure a single construct, but it does not necessarily go up or down as a test becomes more or less unidimensional.

Table 5.1: General Guidelines for Interpreting Reliability Coefficients\(\geq 0.90\)ExcellentExcellent\(0.80 \leq r < 0.90\)GoodExcellent\(0.70 \leq r < 0.80\)AcceptableGood\(0.60 \leq r < 0.70\)BorderlineAcceptable\(0.50 \leq r < 0.60\)LowBorderline\(0.20 \leq r < 0.50\)UnacceptableLow\(0.00 \leq r < 0.20\)UnacceptableUnacceptable

Now that we’ve established the more common estimates of reliability and unreliability, we can discuss the four main study designs that allow us to collect data for our estimates. These designs are referred to as internal consistency, equivalence, stability, and equivalence/stability designs. Each design produces a corresponding type of reliability that is expected to be impacted by different sources of measurement error.

The four standard study designs vary in the number of test forms and the number of testing occasions involved in the study. Until now, we’ve been talking about using two test forms on two separate administrations. This study design is found in the lower right corner of Table 5.2, and it provides us with an estimate of equivalence (for two different forms of a test) and stability (across two different administrations of the test). This study design has the potential to capture the most sources of measurement error, and it can thus produce the lowest estimate of reliability, because of the two factors involved. The more time that passes between administrations, and as two test forms differ more in their content and other features, the more error we would expect to be introduced. On the other hand, as our two test forms are administered closer in time, we move from the lower right corner to the upper right corner of Table 5.2, and our estimate of reliability captures less of the measurement error introduced by the passage of time. We’re left with an estimate of the equivalence between the two forms.

As our test forms become more and more equivalent, we eventually end up with the same test form, and we move to the first column in Table 5.2, where one of two types of reliability is estimated. First, if we administer the same test twice with time passing between administrations, we have an estimate of the stability of our measurement over time. Given that the same test is given twice, any measurement error will be due to the passage of time, rather than differences between the test forms. Second, if we administer one test only once, we no longer have an estimate of stability, and we also no longer have an estimate of reliability that is based on correlation. Instead, we have an estimate of what is referred to as the internal consistency of the measurement. This is based on the relationships among the test items themselves, which we treat as miniature alternate forms of the test. The resulting reliability estimate is impacted by error that comes from the items themselves being unstable estimates of the construct of interest.

Table 5.2: Four Main Reliability Study Designs1 OccasionInternal ConsistencyEquivalence2 OccasionsStabilityEquivalence & Stability

Internal consistency reliability is estimated using either coefficient alpha or split-half reliability. All the remaining cells in Table 5.2 involve estimates of reliability that are based on correlation coefficients.

Table 5.2 contains four commonly used reliability study designs. There are others, including designs based on more than two forms or more than two occasions, and designs involving scores from raters, discussed below.

Which type of reliability should not be used for speed tests?

For speed tests, test-retest and alternate-form reliability are appropriate, but split-half, Coefficient Alpha and KR 20 should be avoided.

How can one calculate reliability of speed test?

Test-retest reliability The procedure is to administer the test to a group of respondents and then administer the same test to the same respondents at a later date. The correlation between scores on the identical tests given at different times operationally defines its test-retest reliability.

What are the 4 types of reliability?

4 ways to assess reliability in research.
Test-retest reliability. The test-retest reliability method in research involves giving a group of people the same test more than once. ... .
Parallel forms reliability. ... .
Inter-rater reliability. ... .
Internal consistency reliability..

Which method can be used to estimate reliability of test?

Three important methods for estimating test reliability are (1) method of parallel forms, (2) test-retest method, (3) split-half method. pointed out that the first two methods do not very often find their application. gap and short gap both give rise to serious bias in the estimation.