What type of validity is generalizable?

Generalizability Theory

Richard J. Shavelson, Noreen M. Webb, in Encyclopedia of Social Measurement, 2005

Generalizability Coefficient

The generalizability coefficient is analogous to classical test theory's reliability coefficient (the ratio of the universe-score variance to the expected observed-score variance; an intraclass correlation). For relative decisions and a p × I × O random-effects design, the generalizability coefficient is:

(7)Eρ2(XpIO,μp)=Eρ2=Ep( μp−μ)2EOEIEp(XpIO−μIO)2=σp2σp2+σδ2

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985001936

External Validity

G.E. Matt, ... M. Sklar, in International Encyclopedia of Education (Third Edition), 2010

External validity refers to the generalizability of an association with respect to four universes: experimental units (U), treatments (T), outcomes (O), and settings (S). Three main generalizability questions are distinguished: (1) To what universes can an association be generalized based on the manifest instances present in a particular study (i.e., representativeness)? (2) Across which universes (or subuniverses) can an association be generalized (i.e., robustness and moderating conditions)? (3) To what yet-unstudied universes can an association be generalized (i.e., extrapolation)? The limitations of statistical sampling theory to answer important generalizability questions are discussed and pragmatic alternative approaches are introduced – including purposive sampling, analog models, technology transfer models, meta-analysis, and Cook’s five pragmatic principles.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947017000

Evidence-Based Medicine

Robyn Bluhm, Kirstin Borgerson, in Philosophy of Medicine, 2011

3.2 General Evidence and Individualized Care

The roots of medical practice are captured in the famous saying attributed to Hippocrates, “It is more important to know what kind of person has the disease than what kind of disease the person has.” In contrast with this highly personalized philosophy of medicine, EBM suggests that medical practice should be based on the generalized results of clinical research. According to the hierarchy, for instance, meta-analyses of RCTs provide excellent evidence, and case-studies provide the worst. In other words, the larger, more abstract general studies provide better evidence and thus guides to practice than do carefully developed, detailed individual studies. The hierarchy is oriented from the general to the specific. As it turns out, however, 1) it isn't easy to defend the claim that certain research designs are more generalizable than others, and 2) this generalizability, if achieved, may actually be a liability in practice.

Let's start with the standard argument for the generalizability of the highest ranked research methods. Because the RCT is performed on a group of individuals (often quite a large group) and collects average patient data across that group, the average patient data obtained from an RCT is more likely to be generalizable to members of the general population than the data obtained from one particular individual in a case study. For all we know, the individual described in a case study may have been an exception to the norm. (In fact, case studies are usually published to illustrate an unusual case or approach to treatment.) This generalizability is thought to extend even further in the case of meta-analyses of RCTs. In an RCT, the exceptional and the usual are averaged to produce one result: the average efficacy of a particular therapeutic intervention.4 And in a meta-analysis, the results of a variety of RCTs are averaged again. Proponents of EBM offer the following advice about applying the results of such research to practice,

[A]sk whether there is some compelling reason why the results should not be applied to the patient. A compelling reason usually won't be found, and most often you can generalize the results to your patient with confidence. [Guyatt and Rennie, 2001, p. 71]

There is a great deal of confidence in the strength of the connection between the results of research and the demands of clinical practice. However, these claims to generalizabilty are not as straightforward as they seem.

Subjects in RCTs are not randomly sampled from the general population. In fact, the subjects enrolled into clinical trials are often quite unrepresentative of the members of the general population or even the target population for a treatment. While it is sometimes possible to design a trial in which the trial population is purposely matched to the target population for the treatment (at least, as closely as possible), it is uncommon for a number of reasons. First, members of the target population often have a number of comorbidities (concurrent conditions).5 It is more difficult to isolate a cause-effect relationship when many other variables (including other medications) are added into the equation. These sorts of comorbidities are, accordingly, seen as a liability for the trial. Researchers often include in their studies only participants who have no other conditions and who are taking no other medications.

Second, members of the target population are often elderly and likely to be put on a treatment for long periods of time or even indefinitely. So the target population would be long-time users of the treatment. However RCTs are usually quite short in duration (ranging from a few weeks to a few months long). There is a gap between the short-term data produced by RCTs and the long-term use by the target population. The data from RCTs, then, are not easily generalized to the standard patient.

Third, it is common for researchers to select their study population in a way that allows them to test the healthiest people. The selection of younger subjects means that, because in general these subjects are healthier than older people, they are less likely to have adverse reactions to the drug. The healthier a person is, the less pronounced the adverse reaction (in general) and more positive the overall reaction. Again, this means the trial population is different from the target population.

Fourth, research trials are often conducted in contexts that differ in significant ways from the contexts of general practice. As a result, therapy may be administered and monitored in a way that cannot be easily replicated. Research is often conducted at facilities with more resources than is standard, and implementation of treatments in contexts where resources are scarce is not straightforward. This contributes to a gap between individual patients and the promises of benefit attributed to particular treatments.

Even if we were to set aside these concerns about the generalizability of research from RCTs, we would still be left with concerns about the gap between generalized results and individual patients. This gap between the results of generalized clinical research evidence and complex individual patient care has been raised time and again in medical literature. Feinstein and Horwitz remind us that, “When transferred to clinical medicine from an origin in agricultural research, randomized trials were not intended to answer questions about the treatment of individual patients” [Feinstein and Horwitz, 1997, p.531]. Patients are idiosyncratic individuals dealing with multiple complex problems. The most scientific medical research, by EBM standards, produces average generalizations across homogenous populations that, as discussed, often fail to resemble the patient because of inclusion and exclusion criteria. The highest ranked clinical research focuses on large-scale studies designed to determine simple causal relationships between treatments and their effects. In many cases, a particular patient would never have qualified for the study because he or she has comorbidities, or does not meet certain age or gender specifications [Van Spall et al., 2007]. As Tonelli points out:

Clinical research, as currently envisioned, must inevitably ignore what may be important, yet non-quantifiable, differences between individuals. Defining medical knowledge solely on the basis of such studies, then, would necessarily eliminate the importance of individual variation from the practice of medicine. [1998, p.1237]

If we pay attention to average outcomes, we may lose sight of significant variation within the trial population. It may be that some subjects responded well, while others responded poorly. Though on average a drug may be more effective than the comparison intervention used in an RCT, this isn't a full characterization of what has occurred. In fact, important subgroup data will likely be missed. The applicability of scientific evidence, especially large-scale, single-factor studies, “depends on the individual being conformable to the group in all relevant aspects,” which is rarely the case [Black, 1998, p.1]. On the basis of these concerns, we suggest that simply binding medical practice to medical research fails to capture the importance, difficulty, and skills required for re-contextualizing and individualizing medical knowledge. This is not to say that medical practice won't benefit from thoughtful and critical use of a wide variety of results from medical research, but that is not what has been proposed. The concern raised here is directed not at a general claim that medical practice should pay attention to medical research (which critics of EBM support), but at the specific rules tying medical practice to medical research in EBM, which fail to capture important distinct elements of medical practice.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444517876500088

Non-parametric procedures

K. Nichols, A. Holmes, in Statistical Parametric Mapping, 2007

Generalizability

Questions often arise about the scope of inference, or generalizability of non-parametric procedures. For parametric tests, when a collection of subjects have been randomly selected from a population of interest and intersubject variability is considered, the inference is on the sampled population and not just the sampled subjects. The randomization test, in contrast, only makes inference on the data at hand: a randomization test regards the data as fixed and uses the randomness of the experimental design to justify exchangeability. A permutation test, while operationally identical to the randomization test, can make inference on a sampled population: a permutation test also regards the data as fixed but it additionally assumes the presence of a population distribution to justify exchangeability, and hence can be used for population inference. The randomization test is truly assumption free, but has a limited scope of inference.

In practice, since subjects rarely constitute a random sample of the population of interest, we find the issue of little practical concern. Scientists routinely generalize results, integrating prior experience, other findings, existing theories, and common sense in a way that a simple hypothesis test does not admit.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123725608500218

Language Acquisition

Allyssa McCabe, in Encyclopedia of Social Measurement, 2005

Naturalistic Observation versus Elicitation

Related to the trade-off between experimental control and generalizability is that between naturalistic observation and elicitation. One of the earliest means used to study language acquisition was a diary of a child's progress, kept by parents who were also often linguists. Such an approach can yield ample, rich data that are true of real situations because they were derived from such situations. However, observation of what a child does, even day in and day out, does not necessarily tell us about a child's capability. Elicitation procedures are best suited to informing us of children's capacity, and by far the best-known such procedure is the wug test developed by Berko Gleason in 1958. Berko Gleason showed children a picture of a strange creature and said, “This is a wug.” She then showed children a picture of two of the strange creatures, saying, “Now there is another one. There are two of them. There are two ____.” Using this procedure, Berko Gleason was able to demonstrate that children were capable of producing grammatical morphemes (e.g., saying, “wugs”) in response to nonsense words they had never heard of before. Children had apparently implicitly acquired certain rules (e.g., for forming plurals) rather than simply mindlessly imitating their parents' productions.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985005363

Cognitive Sciences

M. Gopnik, in Encyclopedia of Physical Science and Technology (Third Edition), 2003

V.F.2 Species Specificity

The second question, about species specificity, has consequences for the generalizability of data about one animal to any other. If there are emergent properties of mind that make the cognitive systems of one species radically different from the cognitive systems of another species, each species must be said to live in its own special cognitive world. Evidence about cognitive systems in one kind of organism is absolutely irrelevant to discussions of processes in another type of organism, even if the systems seem to share characteristics. Although the problem of cross-species comparison is true of all organisms, the real heat of the issue is generated when we try to go from animals, no matter what their kind, to people. Are humans really different in fundamental ways from other creatures, or are they just fancier versions of the same thing? There is no one involved in this debate who does not subscribe to the evolutionary hypothesis. Our lungs and hearts developed from some simpler lungs and hearts. But what about our minds? We know there have been moments in evolution when properties emerged that were different in kind from previous properties. We now would like to know whether certain cognitive processes are the result of such emergent properties or whether all cognitive processes can be derived from simpler processes.

The prime candidate for an emergent cognitive system is language. All normal human babies acquire language at about the same time and in about the same way. Other animals do not. It is true that animals have communication systems that can be intricate, but, it is argued, they differ from human language in several fundamental ways. First, they are very constrained in what they can talk about. For example, while the language of bees can effectively communicate about the location and the distance of the food source, it does not and cannot talk about the scenery along the way, or the likely predators to be encountered, or the similarity of this field to the one visited last week. These systems are not creative in the sense that they do not combine elements in new ways to produce entirely new utterances. Language can do this. In general, the formal properties which we find in all natural languages are absent from animal communication systems. But, second, and more important, humans can use language intentionally. Not only can language be used to inform, it can also be used to deceive. We can choose to speak or to remain silent. Animals are not free to communicate or not. In animals when the conditions for the system to be activated are present, the communication must take place. Bees returning to the hive do their dance no matter what. Speakers not only can choose to communicate, they know that they can choose to do so and that the hearer knows that they intend to communicate. The speaker knows that the hearer knows that the speaker intends to communicate, and so on. Humans do not just have intentions, they are conscious of their own intentions and can reflect on them. Even if we want to say that some actions in some creatures seem to be intentional, we cannot find any behavior indicating this level of reflection. But even if animals do not naturally acquire language, they might still have untapped cognitive capacities to learn such systems. Several experiments designed to teach language to apes have been mounted. The results have been mixed. The creatures have shown themselves capable of learning to use symbols to refer to objects and actions. They have been able to master simple rules for ordering these symbols. They have used these symbols not only in training situations, but also in novel situations to communicate with other creatures. In no sense, though, have they achieved the formal complexity or the levels of intentionality we see in the language of children. Given these fundamental failures, it is not clear that their accomplishment should be considered to be rudimentary language.

These theoretical considerations that argue for language being an emergent property have recently been supplemented by empirical data which indicate that certain properties of language may be controlled by an autosomally dominant gene. Several recent studies of the pattern of occurrence of specific developmental language impairments in twins and in families have indicated that these impairments are likely to be associated with an autosomally dominant gene. The language of these subjects has been studied and it can be shown that the deficit is in a specific part of their underlying grammar. Therefore, both theoretical considerations and empirical data seem to indicate that at least certain parts of language may be a species-specific attribute of humans.

When we look at the other cognitive domains, the answer to the question is less clear. Animals do have visual systems, auditory systems, memories, and some kinds of inferencing. We know that these systems may differ from one another in their basic organizing principles. Whether these differences are best thought of as creating incommensurate systems or as being continuous with one another is still a vexed question. The answer awaits the discovery of some unifying theory or the proof that none exists.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122274105001174

Learning Strategies

C.E. Weinstein, ... T.W. Acee, in International Encyclopedia of Education (Third Edition), 2010

Domain-Independent Strategies Versus Domain-Dependent Strategies

In current theory, research, and practice, the applicability, or generalizability, of particular learning strategies to different learning content areas or tasks is still being debated. The general issue is whether it is best for students to learn domain (content or task)-specific strategies (e.g., strategies for solving a particular type of physics problem or learning a new vocabulary term in a foreign language) or more generalizable, or domain (content or task)-independent strategies that can be applied to many content areas (e.g., how to approach an unfamiliar textbook or using self-testing to check your understanding of what you are learning). In fact, if you think about it in terms of a generalization gradient, they are really just different points on the line. If the strategy has a narrower domain of applicability (i.e., it can only be used for a relatively small number of learning or performance activities), then it is domain dependent. If, on the other hand, it can be used in a wide variety of situations or content areas, then it has a wide domain of applicability and is domain independent. Like many controversies, it appears that it takes a bit of both to help students become self-regulating, strategic learners. Some strategies may be more effective and efficient for the content and tasks in one particular academic area, while others may be helpful for a wider variety of academic areas and tasks.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947004978

Machine Learning

Peter Wittek, in Quantum Machine Learning, 2014

2.5 Model Complexity

The complexity of the class of functions performing classification or regression and the algorithm’s generalizability are related. The Vapnik-Chervonenkis (VC) theory provides a general measure of complexity and proves bounds on errors as a function of complexity. Structural risk minimization is the minimization of these bounds, which depend on the empirical risk and the capacity of the function class (Vapnik, 1995).

Consider a function f with a parameter vector θ: it shatters a set of data points {x1, x2, …, xN} if, for all assignments of labels to those points, there exists a θ such that the function f makes no errors when evaluating that set of data points. A set of N points can be labeled in 2N ways. A rich function class is able to realize all 2N separations—that is, it shatters the N points.

The idea of VC dimensions lies at the core of the structural risk minimization theory: it measures the complexity of a class of functions. This is in stark contrast to the measures of generalization performance in Section 2.4, which derive them from the sample and the distribution.

The VC dimension of a function f is the maximum number of points that are shattered by f. In other words, the VC dimension of the function f is h′, where h′ is the maximum h such that some data point set of cardinality h can be shattered by f. The VC dimension can be infinity (Figure 2.4).

What type of validity is generalizable?

What type of validity is generalizable?

Figure. 2.4. Examples of shattering sets of points. (a) A line on a plane can shatter a set of three points with arbitrary labels, but it cannot shatter certain sets of four points; hence, a line has a VC dimension of four. (b) A sine function can shatter any number of points with any assignment of labels; hence, its VC dimension is infinite.

Vapnik’s theorem proves a connection between the VC dimension, empirical risk, and the generalization performance (Vapnik and Chervonenkis, 1971). The probability of the test error distancing from an upper bound on data that are drawn independent and identically distributed from the same distribution as the training set is given by

(2.15)P(EN(f)≤E+h[log(2n/h)+1]−log(η/4)n)=1−η

if h ≪ n, where h is the VC dimension of the function. When h ≪ n, the function class should be large enough to provide functions that are able to model the hidden dependencies in the joint distribution P(x, y).

This theorem formally binds model complexity and generalization performance. Empirical risk minimization—introduced in Section 2.4—allows us to pick an optimal model given a fixed VC dimension h for the function class. The principle that derives from Vapnik’s theorem—structural risk minimization—goes further. We optimize empirical risk for a nested sequence of increasingly complex models with VC dimensions h1 < h2 < ⋯, and select the model with the smallest value of the upper bound in Equation 2.15.

The VC dimension is a one-number summary of the learning capacity of a class of functions, which may prove crude for certain classes (Schölkopf and Smola, 2001, p. 9). Moreover, the VC dimension is often difficult to calculate. Structural risk minimization successfully applies in some cases, such as in support vector machines (Chapter 7).

A concept related to VC dimension is probably approximately correct (PAC) learning (Valiant, 1984). PAC learning stems from a different background: it introduces computational complexity to learning theory. Yet, the core principle is common. Given a finite sample, a learner has to choose a function from a given class such that, with high probability, the selected function will have low generalization error. A set of labels yi are PAC-learnable if there is an algorithm that can approximate the labels with a predefined error 0 < ϵ < 1/2 with a probability at least 1 − δ, where 0 < δ < 1/2 is also predefined. A problem is efficiently PAC-learnable if it is PAC-learnable by an algorithm that runs in time polynomial in 1/ϵ, 1/δ, and the dimension d of the instances. Under some regularity conditions, a problem is PAC-learnable if and only if its VC dimension is finite (Blumer et al., 1989).

An early result in quantum learning theory proved that all PAC-learnable function classes are learnable by a quantum model (Servedio and Gortler, 2001); in this sense, quantum and classical PAC learning are equivalent. The lower bound on the number of examples required for quantum PAC learning is close to the classical bound (Atici and Servedio, 2005). Certain classes of functions with noisy labels that are classically not PAC-learnable can be learned by a quantum model (Bshouty and Jackson, 1995). If we restrict our attention to transductive learning problems, and we do not want to generalize to a function that would apply to an arbitrary number of new instances, we can explicitly define a class of problems that would take an exponential amount of time to solve classically, but a quantum algorithm could learn it in polynomial time (Gavinsky, 2012). This approach does not fall in the bounded error quantum polynomial time class of decision problems, to which most known quantum algorithms belong (see Section 4.6).

The connection between PAC-learning theory and machine learning is indirect, but explicit connection has been made to some learning algorithms, including neural networks (Haussler, 1992). This already suggests that quantum machine learning algorithms learn with a higher precision, even in the presence of noise. We give more specific details in Chapters 11 and 14. Here we point out that we do not deal with the exact identification of a function (Angluin, 1988), which also has various quantum formulations and accompanying literature.

Irrespective of how we optimize the learning function, there is no free lunch: there cannot be a class of functions that is optimal for all learning problems (Wolpert and Macready, 1997). For any optimization or search algorithm, better performance in one class of problems is balanced by poorer performance in another class. For this reason alone, it is worth looking into combining different learning models.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128009536000025

Generalizability Theory

R.L. Brennan, in International Encyclopedia of Education (Third Edition), 2010

Coefficients

Two types of reliability-like coefficients are widely used in generalizability theory. One coefficient is called a generalizability coefficient and denoted Eρ2. The other coefficient is an index of dependability and is denoted Φ.

Generalizability coefficient, Eρ2. The generalizability coefficient is the ratio of the universe score variance to itself plus relative error variance:

[10]Eρ 2=σ2(τ)σ2(τ)+σ2(δ)

It is the analog of a reliability coefficient in classical theory. For the Writing assessment with n′t = 6 and n′r = 2,

Eρˆ2=0.6910.691+0.069=0.91

The right-hand panel of Figure 1 provides a graph of Eρˆ2 for n′t ranging from 1 to 12 and for n′r ranging from 1 to 3. As observed in the discussion of SEMs, little is gained by having more than two raters or more than 10 prompts.

Index of dependability, Φ. An index of dependability is the ratio of universe score variance to itself plus absolute error variance:

[11]Φ=σ2(τ)σ2(τ)+σ2(Δ)

Φ differs from Eρ2 in that Φ involves σ2(Δ), whereas Eρ2 involves σ2(δ). Since σ2 (Δ) is generally larger than σ2(δ), it follows that Φ is generally smaller than Eρ2. The index Φ is appropriate when scores are given absolute interpretations, as in domain-referenced or criterion-referenced situations. For the Writing assessment with n′t = 6 and n′r = 2,

Φˆ=0.6910.691+0.079=0.90

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947002463

Qualitative Analysis, Anthropology

D. Jean Clandinin, in Encyclopedia of Social Measurement, 2005

Criteria for Assessing Qualitative Research Texts

Although it is generally agreed among qualitative researchers that the criteria for judging qualitative research are not validity, reliability, and generalizability in the ways those terms are understood in quantitative methodologies, the criteria for judging qualitative research are still under development. Triangulation, member checking, and audit trails that allow external researchers to reconstruct the research process are used in some qualitative methodologies. Criteria such as plausibility, persuasiveness, authenticity, and verisimilitude are under consideration. Resonance with the experience of readers is another criterion currently in use as a way to judge the quality of research.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000852

How does external validity relate to generalizability of results?

External validity is the extent to which you can generalize the findings of a study to other situations, people, settings and measures. In other words, can you apply the findings of your study to a broader context? The aim of scientific research is to produce generalizable knowledge about the real world.

What are the four types of validity?

Table of contents.
Construct validity..
Content validity..
Face validity..
Criterion validity..

What makes a test generalizable?

If the results of a study are broadly applicable to many different types of people or situations, the study is said to have good generalizability. If the results can only be applied to a very narrow population or in a very specific situation, the results have poor generalizability.

Is Generalisability the same as validity?

Generalisability describes the extent to which research findings can be applied to settings other than that in which they were originally tested. A study is externally valid if it describes the true state of affairs outside its own setting.