Formal Learning Theory

First published Sat Feb 2, 2002; substantive revision Wed Mar 23, 2022

Formal learning theory is the mathematical embodiment of a normative epistemology. It deals with the question of how an agent should use observations about her environment to arrive at correct and informative conclusions. Philosophers such as Putnam, Glymour and Kelly have developed learning theory as a normative framework for scientific reasoning and inductive inference.

Terminology. Cognitive science and related fields typically use the term “learning” for the process of gaining information through observation— hence the name “learning theory”. To most cognitive scientists, the term “learning theory” suggests the empirical study of human and animal learning stemming from the behaviourist paradigm in psychology. The epithet “formal” distinguishes the subject of this entry from behaviourist learning theory. Philosophical terms for learning-theoretic epistemology include “logical reliability” (Kelly [1996], Glymour [1991]) and “means-ends epistemology” (Schulte [1999]).

Because many developments in, and applications of, formal learning theory come from computer science, the term “computational learning theory” is also common. Many results on learning theory in computer science are concerned with Valiant’s and Vapnik’s notion of learning generalizations that are probably approximately correct (PAC learning) (Valiant [1984]). This notion of empirical success was introduced to philosophers by Gilbert Harmann in his Nicode lectures, and elaborated in a subsequent book [2007]. Valiant himself provides an accessible account of PAC learning and its relationship to the problem of induction in a recent book (Valiant [2013, Ch. 5]). The present article describes a nonstatistical tradition of learning theory stemming from the seminal work of Hilary Putnam [1963] and Mark E. Gold [1967]. Recent research has extended the reliabilist means-ends approach to a statistical setting where inductive methods assign probabilities to statistical hypotheses from random samples. The new statistical framework is described at the end of this entry, in the Section on Reliable Statistical Inquiry.

Philosophical characteristics. In contrast to other philosophical approaches to inductive inference, learning theory does not aim to describe a universal inductive method or explicate general axioms of inductive rationality. Rather, learning theory pursues a context-dependent means-ends analysis [Steele 2010]: For a given empirical problem and a set of cognitive goals, what is the best method for achieving the goals? Most of learning theory examines which investigative strategies reliably and efficiently lead to correct beliefs about the world.

Article Overview. Compared to traditional philosphical discussions of inductive inference, learning theory provides a radically new way of thinking about induction and scientific method. The main aim of this article is explain the main concepts of the theory through examples. Running examples are repeated throughout the entry; at the same time, the sections are meant to be as independent of each other as possible. We use the examples to illustrate some theorems of philosophical interest, and to highlight the key philosophical ideas and insights behind learning theory.

Readers interested in the mathematical substance of learning theory will find some references in the Bibliography, and a summary of the basic definitions in the Supplementary Document. A text by Jain et al. collects many of the main definitions and theorems [1999]. New results appear in proceedings of annual conferences, such as the Conferences on Learning Theory (COLT) and Algorithmic Learning Theory (ALT). The philosophical issues and motivation pertaining to learning-theoretic epistemology are discussed extensively in the works of philosophers such as Putnam, Glymour and Kelly (Putnam [1963], Glymour [1991], Glymour and Kelly [1992], Kelly [1996]).

1. Convergence to the Truth and Nothing But the Truth
2. Case Studies from Scientific Practice
3. The Limits of Inquiry and the Complexity of Empirical Problems
4. The Long Run in the Short Run: Reliable and Stable Beliefs
5. Simplicity, Stable Belief, and Ockham’s Razor
6. Reliable Learning for Statistical Hypotheses
7. Other Approaches: Categorical vs. Hypothetical Imperatives
Supplementary Document: Basic Formal Definitions
Bibliography
Academic Tools
Other Internet Resources
Related Entries

1. Convergence to the Truth and Nothing But the Truth

Learning-theoretic analysis assesses dispositions for forming beliefs. Several terms for belief acquisition processes are in common use in philosophy; I will use “inductive strategy”, “inference method” and most frequently “inductive method” to mean the same thing. The best way to understand how learning theory evaluates inductive methods is to work through some examples. The following presentation begins with some very simple inductive problems and moves on to more complicated and more realistic settings.

1.1 Simple Universal Generalization

Let’s revisit the classic question of whether all ravens are black. Imagine an ornithologist who tackles this problem by examining one raven after another. There is exactly one observation sequence in which only black ravens are found; all others feature at least one nonblack raven. The figure below illustrates the possible observation sequences. Dots in the figure denote points at which an observation may be made. A black bird to the left of a dot indicates that at this stage, a black raven is observed. Similarly, a white bird to the right of a dot indicates that a nonblack raven is observed. Given a complete sequence of observations, either all observed ravens are black or nonblack; the figure labels complete observation sequences with the statement that is true of them. The gray fan indicates that after the observation of a white raven, the claim that not all ravens are black holds on all observation sequences resulting from further observations.

a diagram: link to extended description below

Figure 1 [An extended description of figure 1 is in a supplement.]

If the world is such that only black ravens are found, we would like the ornithologist to settle on this generalization. (It may be possible that some nonblack ravens remain forever hidden from sight, but even then the generalization “all ravens are black” at least gets the observations right.) If the world is such that eventually a nonblack raven is found, then we would like the ornithologist to arrive at the conclusion that not all ravens are black. This specifies a set of goals of inquiry. For any given inductive method that might represent the ornithologist’s disposition to adopt conjectures in the light of the evidence, we can ask whether that method measures up to these goals or not. There are infinitely many possible methods to consider; we’ll look at just two, a skeptical one and one that boldly generalizes. The bold method conjectures that all ravens are black after seeing that the first raven is black. It hangs on to this conjecture unless some nonblack raven appears. The skeptical method does not go beyond what is entailed by the evidence. So if a nonblack raven is found, the skeptical method concludes that not all ravens are black, but otherwise the method does not make a conjecture one way or another. The figure below illustrates both the bold and the skeptical method.

two part diagram: link to extended description below

Figure 2 [An extended description of figure 2 is in a supplement.]

Do these methods attain the goals we set out? Consider the bold method. There are two possibilities: either all observed ravens are black, or some nonblack raven is found. In the first case, the method conjectures that all ravens are black and never abandons this conjecture. In the second case, the method concludes that not all ravens are black as soon as the first nonblack raven is found. Hence no matter how the evidence comes in, eventually the method gives the right answer as to whether all ravens are black and sticks with this answer. Learning theorists call such methods reliable because they settle on the right answer no matter what observations the world provides.

The skeptical method does not measure up so well. If a nonblack raven appears, then the method does arrive at the correct conclusion that not all ravens are black. But if all ravens are black, the skeptic never takes an “inductive leap” to adopt this generalization. So in that case, the skeptic fails to provide the right answer to the question of whether all ravens are black.

This illustrates how means-ends analysis can evaluate methods: the bold method meets the goal of reliably arriving at the right answer, whereas the skeptical method does not. Note the character of this argument against the skeptic: The problem, in this view, is not that the skeptic violates some canon of rationality, or fails to appreciate the “ uniformity of nature”. The learning-theoretic analysis concedes to the skeptic that no matter how many black ravens have been observed in the past, the next one could be white. The issue is that if all observed ravens are indeed black, then the skeptic never answers the question “are all ravens black?”. Getting the right answer to that question requires generalizing from the evidence even though the generalization could be wrong.

As for the bold method, it’s important to be clear on what it does and does not achieve. The method will eventually settle on the right answer—but it (or we) may never be certain that it has done so. As William James put it, “no bell tolls” when science has found the right answer. We are certain that the method will eventually settle on the right answer; but we may never be certain that the current answer is the right one. This is a subtle point; the next example illustrates it further.

1.2 The New Riddle of Induction

Nelson Goodman posed a famous puzzle about inductive inference known as the (New) Riddle of Induction ([Goodman 1983]). Our next example is inspired by his puzzle. Goodman considered generalizations about emeralds, involving the familiar colours of green and blue, as well as certain unusual ones:

Suppose that all emeralds examined before a certain time \(t\) are green …. Our evidence statements assert that emerald \(a\) is green, that emerald \(b\) is green, and so on..
Now let us introduce another predicate less familiar than “green”. It is the predicate “grue” and it applies to all things examined before \(t\) just in case they are green but to other things just in case they are blue. Then at time \(t\) we have, for each evidence statement asserting that a given emerald is green, a parallel evidence statement asserting that emerald is grue. The question is whether we should conjecture that all emeralds are green rather than that all emeralds are grue when we obtain a sample of green emeralds examined before time \(t\), and if so, why.

Clearly we have a family of grue predicates in this problem, one for each different “critical time” \(t\); let’s write grue\((t)\) to denote these grue predicates. Following Goodman, let us refer to methods as projection rules in discussing this example. A projection rule succeeds in a world just in case it settles on a generalization that is correct in that world. Thus in a world in which all examined emeralds are found to be green, we want our projection rule to converge to the proposition that all emeralds are green. If all examined emeralds are grue\((t)\), we want our projection rule to converge to the proposition that all emeralds are grue\((t)\). Note that this stipulation treats green and grue predicates completely on a par, with no bias towards either. As before, let us consider two rules: the natural projection rule which conjectures that all emeralds are green as long as only green emeralds are found, and the gruesome rule which keeps projecting the next grue predicate consistent with the available evidence. Expressed in the green-blue vocabulary, the gruesome projection rule conjectures that after observing some number of \(n\) green emeralds, all future ones will be blue. The figures below illustrates the possible observation sequences, the natural projection rule, and the gruesome projection rule.

natural projection rule: link to extended description below

Figure 3 [An extended description of figure 3 is in a supplement.]

The following figure shows the gruesome projection rule.

gruesome projection rule: link to extended description below

Figure 4 [An extended description of figure 4 is in a supplement.]

How do these rules measure up to the goal of arriving at a true generalization? Suppose for the sake of the example that the only serious possibilities under consideration are: (1) Either all emeralds are green or (2) all emeralds are grue\((t)\) for some critical time \(t\). Then the natural projection rule settles on the correct generalization no matter what the correct generalization is. For if all emeralds are green, the natural projection rule asserts this fact from the beginning. And suppose that all emeralds are grue\((t)\) for some critical time \(t\). Then at time \(t\), a blue emerald will be observed. At this point the natural projection rule settles on the conjecture that all emeralds are grue\((t)\), which must be correct given our assumption about the possible observation sequences. Thus no matter what evidence is obtained in the course of inquiry—consistent with our background assumptions—the natural projection rule eventually settles on a correct generalization about the colour of emeralds.

The gruesome rule does not do as well. For if all emeralds are green, the rule will never conjecture this fact because it keeps projecting grue predicates. Hence there is a possible observation sequence—namely those on which all emeralds are green—on which the gruesome rule fails to converge to the right generalization. So means-ends analysis would recommend the natural projection rule over the gruesome rule.

1.3 Discussion

The means-ends analysis of the Riddle of Induction illustrates a number of philosophically important points that holds for learning-theoretic analysis in general.

Equal Treatment of All Hypotheses. As in the previous example, nothing in this argument hinges on arguments to the effect that certain possibilities are not to be taken seriously a priori. In particular, nothing in the argument says that generalizations with grue predicates are ill-formed, unlawlike, or in some other way a priori inferior to “all emeralds are green”.
Language Invariance. The analysis does not depend on the vocabulary in which the evidence and generalizations are framed. For ease of exposition, I have mostly used the green-blue reference frame. However, grue-bleen speakers would agree that the aim of reliably settling on a correct generalization requires the natural projection rule rather than the gruesome one, even if they would want to express the conjectures of the natural rule in their grue-bleen language rather than the blue-green language that we have used so far.
Dependence on Context. Though the analysis does not depend on language, it does depend on assumptions about what the possible observation sequences are. The example as described above seems to comprise the possibilities that correspond to the colour predicates Goodman himself discussed. But means-ends analysis applies just as much to other sets of possible predicates. Schulte [1999] and Chart [2000] discuss a number of other versions of the Riddle of Induction, in some of which means-ends analysis favours projecting that all emeralds are grue on a sample of all green emeralds.

1.4 Falsificationism and Generalizations with Exceptions

Our first two examples feature simple universal generalizations. Some subtle aspects of the concept of long-run reliability, particularly its relationship to falsificationism, become apparent if we consider generalizations that allow for exceptions. To illustrate, let us consider another ornithological example. Two competing hypotheses are under investigation.

All but finitely many swans are white. That is, basically all swans are white, except for a finite number of exceptions to the rule.
All but finitely many swans are black. That is, basically all swans are black, except for a finite number of exceptions to the rule.

Assuming that one or the other of these hypotheses is correct, is there an inductive method that reliably settles on the right one? What makes this problem more difficult than our first two is that each hypothesis under investigation is consistent with any finite amount of evidence. If 100 white swans and 50 black swans are found, either the 50 black swans or the 100 white swans may be the exception to the rule. In terminology made familiar by Karl Popper’s work, we may say that neither hypothesis is falsifiable. As a consequence, the inductive strategy from the previous two examples will not work here. This strategy was basically to adopt a “bold” universal generalization, such as “all ravens are black” or “all emeralds are green”, and to hang on to this conjecture as long as it “passes muster”. However, when rules with possible exceptions are under investigation, this strategy is unreliable. For example, suppose that an inquirer first adopts the hypothesis that “all but finitely many swans are white”. It may be the case that from then on, only black swans are found. But each of these apparent counterinstances can be “explained away” as an exception. If the inquirer follows the principle of hanging on to her conjecture until the evidence is logically inconsistent with the conjecture, she will never abandon her false belief that all but finitely many swans are white, much less arrive at the correct belief that all but finitely many swans are black.

Reliable inquiry requires a more subtle investigative strategy. Here is one (of many). Begin inquiry with either competing hypothesis, say “all but finitely many swans are black”. Choose some cut-off ratio to represent a “clear majority”; for definiteness, let’s say 70%. If the current conjecture is that all but finitely many swans are black, change your mind to conjecture that all but finitely many swans are white just in case over 70% of observed swans are in fact white. Proceed likewise if the current conjecture is that all but finitely many swans are white when over 70% of observed swans are in fact black.

A bit of thought shows that this rule reliably identifies the correct hypothesis in the long run, no matter which of the two competing hypotheses is correct. For if all but finitely many swans are black, eventually the nonblack exceptions to the rule will be exhausted, and an arbitrarily large majority of observed swans will be black. Similarly if all but finitely many swans are white.

Generalizations with exceptions illustrate the relationship between Popperian falsificationism and the learning-theoretic idea of reliable convergence to the truth. In some settings of inquiry, notably those involving universal generalizations, a naively Popperian “conjectures-and-refutations” approach of hanging on to conjectures until the evidence falsifies them does yield a reliable inductive method. In other problems, like the current example, it does not. Relying on falsifications is sometimes, but not always, the best way for inquiry to proceed. Learning theory has provided mathematical theorems that clarify the relationship between a conjectures-and-refutations approach and reliable inquiry. The details are discussed in Section 3 (The Limits of Inquiry and the Complexity of Empirical Problems). Generally speaking methods that solve a learning problem with unfalsifiable hypotheses can be represented as employing a refined hypothesis space where the original hypotheses are replaced by strengthened hypotheses that are falsifiable.

2. Case Studies from Scientific Practice

This section provides further examples to illustrate learning-theoretic analysis. The examples in this section are more realistic and address methodological issues arising in scientific practice. They are not probabilistic; statistical hypotheses are discussed in Section 6. This entry provides an outline of the full analysis; there are references to more detailed discussions below. More case studies may be found in [Kelly 1996, Ch. 7.7, Harrell 2000]. Readers who wish to proceed to the further development of the theory and philosophy of means-ends epistemology can skip this section without loss of continuity.

2.1 Conservation Laws in Particle Physics

One of the hallmarks of elementary particle physics is the discovery of new conservation laws that apply only in the subatomic realm [Ford 1963, Ne’eman and Kirsh 1983, Feynman 1965]. (Feynman groups one of them, the conservation of Baryon Number, with the other “great conservation laws” of energy, charge and momentum.) Simplifying somewhat, conservation principles serve to explain why certain processes involving elementary particles do not occur: the explanation is that some conservation principle was violated (cf. Omnes [1971, Ch.2] and Ford [1963]). So a goal of particle inquiry is to find a set of conservation principles such that for every process that is possible according to the (already known) laws of physics, but fails to be observed experimentally, there is some conservation principle that rules out that process. And if a process is in fact observed to occur, then it ought to satisfy all conservation laws that we have introduced.

This constitutes an inference problem to which we may apply means-ends analysis. An inference method produces a set of conservation principles in response to reports of observed processes. Means-ends analysis asks which methods are guaranteed to settle on conservation principles that account for all observations, that is, that rule out unobserved processes and allow observed processes. Schulte [2008] describes an inductive method that accomplishes this goal. Informally the method may be described as follows.

Suppose we have observed a set of reactions among elementary particles.
Conjecture a set of conservation laws that permits the observed reactions and rules out as many unobserved reactions as possible.

The logic of conservation laws is such that the observation of some reactions entails the possibility of other unobserved ones. The learning-theoretic method rules out all reactions that are not entailed. It turns out that the conservation principles that this method would posit on the currently available evidence are empirically equivalent to the ones that physicists have introduced. Specifically, their predictions agree exactly with the conservation of charge, baryon number, muon number, tau number and Lepton number that is part of the Standard Model of particle physics.

For some physical processes, the only way to get empirically adequate conservation principles is by positing that some hidden particles have gone undetected. Schulte [2009] extends the analysis such that an inductive method may not only introduce conservation laws, but also posit unseen particles. The basic principle is again to posit unseen particles in such a way that we rule out as many unobserved reactions as possible. When this method is applied to the known particle data, it rediscovers the existence of an electron antineutrino. This is one of the particles of key concern in current particle physics.

2.2 Causal Connections

There has been a substantive body of research on learning causal relationships as represented in a causal graph [Spirtes et al. 2000]. Kelly suggested a learning-theoretic analysis of inferring causality where the evidence is provided in the form of observed significant correlations among variables of interest (a modern version of Hume’s “constant conjunctions”). The following inductive method is guaranteed to converge to an empirically adequate causal graph as more and more correlations are observed [Schulte, Luo and Greiner 2007].

Suppose we have observed a set of correlations or associations among a set of variables of interest.
Select a causal graph that explains the observed correlations with a minimum number of direct causal links.

2.3 Models of Cognitive Architecture

Some philosophers of mind have argued that the mind is composed of fairly independent modules. Each module has its own “input” from other modules and sends “output” to other modules. For example, an “auditory analysis system” module might take as input a heard word and send a phonetic analysis to an “auditory input lexicon”. The idea of modular organization raises the empirical question of what mental modules there are and how they are linked to each other. A prominent tradition of research in cognitive neuroscience has attempted to develop a model of mental architecture along these lines by studying the responses of normal and abnormal subjects to various stimuli. The idea is to compare normal reactions with abnormal ones—often caused by brain damage—so as to draw inferences about which mental capacities depend on each other and how.

Glymour [1994] asked the reliabilist question whether there are inference methods that are guaranteed to eventually settle on a true theory of mental organization, given exhaustive evidence about normal and abnormal capacities and reactions. He argued that for some possible mental architectures, no amount of evidence of the stimulus-response kind can distinguish between them. Since the available evidence determines the conjectures of an inductive method, it follows that there is no guarantee that a method will settle on the true model of cognitive architecture. Glymour has also explored to what extent richer kinds of evidence would resolve underdetermination of mental architecture. (One example of richer evidence are double disassocations. An example of a double dissocation would be a pair of patients, one who has a normal capacity for understanding spoken words, but fails to understand written ones, and another who understands written words but not spoken ones.)

In further discussion, Bub [1994] showed that if we grant certain restrictive assumptions about how mental modules are connected, then a complete set of behavioural observations would allow a neuropsychologist to ascertain the module structure of a (normal) mind. In fact, under Bub’s assumptions there is a reliable method for identifying the modular structure. The high-level idea of the procedure is as follows.

Every hypothesized modular structure can be identified with a graph \(G\) containing an edge from module \(M_1 \rightarrow M_2\) if module \(M_1\) calls on module \(M_2\).
Each module graph \(G\) is consistent with a set of possible paths among modules. Say that a graph \(G\) is more constrained than another graph \(G'\) if the paths defined by \(G\) are a subset of those constrained by \(G'\).
Conjecture any module graph \(G\) that is maximally constrained, that is, there is no other graph \(G'\) more constrained than \(G\).

2.4 Discussion

These studies illustrate some general features of learning theory:

Generality. The basic notions of the theory are very general. Essentially, the theory applies whenever one has a question that prompts inquiry, a number of candidate answers, and some evidence for deciding among the answers. Thus means-ends analysis can be applied in any discipline aimed at empirical knowledge, for example physics or psychology.
Context Dependence. Learning theory is pure normative a priori epistemology in the sense that it deals with standards for assessing methods in possible settings of inquiry. But the approach does not aim for universal, context-free methodological maxims. The methodological recommendations depend on contingent factors, such as the operative methodological norms, the questions under investigation, the background assumptions that the agent brings to inquiry, the observational means at her disposal, her cognitive capacities, and her epistemic aims. As a consequence, to evaluate specific methods in a given domain, as in the case studies mentioned, one has to study the details of the case in question. The means-ends analysis often rewards this study by pointing out what the crucial methodological features of a given scientific enterprise are, and by explaining precisely why and how these features are connected to the success of the enterprise in attaining its epistemic aims.
Trade-offs. In the perspective of means-ends epistemology, inquiry involves an ongoing struggle with hard choices, rather than the execution of a universal “scientific method”. The inquirer has to balance conflicting values, and may consider various strategies such as accepting difficulties in the short run hoping to resolve them in the long run. For example in the conservation law problem, there can be conflicts between theoretical parsimony, i.e., positing fewer conservation laws, and ontological parsimony, i.e., introducing fewer hidden particles. For another example, a particle theorist may accept positing undetected particles in the hopes that they will eventually be observed as science progresses. The search for the Higgs boson illustrates this strategy. An important learning-theoretic project is to examine when such tradeoffs arise and what the options for resolving them are. Section 4 extends learning-theoretic analysis to consider goals in addition to long-run reliability.

3. The Limits of Inquiry and the Complexity of Empirical Problems

After seeing a number of examples like the ones described above, one begins to wonder what the pattern is. What is it about an empirical question that allows inquiry to reliably arrive at the correct answer? What general insights can we gain into how reliable methods go about testing hypotheses? Learning theorists answer these questions with characterization theorems. Characterization theorems are generally of the form “it is possible to attain this standard of empirical success in a given inductive problem if and only if the inductive problem meets the following conditions”.

We first cover the case where inquiry can provide certainty as to whether an empirical hypothesis is correct (relative to background knowledge). Then we consider when and how inquiry can converge to a correct hypothesis without ever arriving at certain conclusions, as described in Section 1. We will introduce enough definitions and formal concepts to state the results precisely; the supplementary document provides a full formalization.

A learning problem is defined by a finite or countably infinite set of possible hypotheses \(\mathbf{H} = H_1, H_2 , \ldots ,H_n,\ldots\). These hypotheses are mutually exclusive and jointly cover all possibilities consistent with the inquirer’s background assumptions.

Examples

In the raven color problem of Section 1.1, there are two hypotheses \(H_1 =\) “all (observed) ravens are black” and \(H_2 =\) “some (observed) raven is not black”.
In the New Riddle of Induction of from Section 1.2, there are infinitely many alternative hypotheses: We have \(H_{green} =\) “all (observed) emeralds are green” and countably many alternatives of the form \(H_t =\) “all (observed) emeralds are grue\((t)\)” where \(t\) is a natural number.

This section defines properties of hypotheses that determine in which sense inquiry from observations can indicate whether they are correct. These properties are not absolute but relative to a set of alternatives \(\mathbf{H}\), any one of which may obtain for all the inquirer knows. The most basic relative property is relative entailment.

A hypothesis \(H\) is consistent with a finite number of observations if \(H\) is correct for some complete data sequence that extends the finite observations.
A finite number of observations falsifies hypothesis \(H\) if \(H\) is inconsistent with the observations.
A finite number of observations entails hypothesis \(H\) relative to a hypothesis set \(\mathbf{H}\) if \(H\) is the only hypothesis in \(\mathbf{H}\) consistent with the observations.

Note that since logical entailment does not depend on the language we use to frame evidence and hypotheses, the concepts of consistency, entailment, and falsification do not depend on the language we use to frame evidence and hypotheses.

Examples. Recall the raven scenario from Section 1.1 (diagram repeated for convenience).

same as figure 1: link to extended description below

Figure 1 [An extended description of figure 1 is in a supplement.]

The observation that the first raven is black is consistent with both hypotheses \(H_1 =\) “all (observed) ravens are black” and \(H_2 =\) “some (observed) raven is not black”. The observation that the first raven, or any raven, is white falsifies the hypothesis \(H_1\) and entails the hypothesis \(H_2\). The entailment is illustrated by the gray fan structure, which means that after the observation of any white raven, the hypothesis \(H_1\) is correct for any extending complete data sequence that records the color of all further observed ravens.

3.1 Verifiable and Refutable Hypotheses

The next set of concepts we need to understand the structure of hypotheses that can be settled by reliable inquiry are the notions of verifiable and falsifiable hypotheses. The verifiability and falsifiability of claims have been extensively discussed in epistemology and the philosophy of science, especially by philosophers concerned with issues in logical empiricism. This subsection describes how these concepts are used in learning theory, then compares the learning-theoretic concepts with discussions in broader epistemology.

A hypothesis \(H\) is verifiable if whenever \(H\) is correct, eventually evidence is observed that entails that \(H\) is correct. More formally: \(H\) is verifiable with respect to a hypothesis set \(\mathbf{H}\) if for every complete data sequence for which \(H\) is correct, there is a finite number of observations that falsify all alternative hypotheses \(H'\) from \(\mathbf{H}\).
A hypothesis \(H\) is refutable if whenever \(H\), eventually evidence is observed that falsifies \(H\). More formally: \(H\) is refutable with respect to a hypothesis set \(\mathbf{H}\) if for every complete data sequence for which \(H\) is not correct (but some other hypothesis in \(\mathbf{H}\) is), there is a finite number of observations that falsifies \(H\).

Examples

The hypothesis \(H_2 =\) “some (observed) raven is not black” is verifiable but not refutable. It is verifiable because any data sequence for which it is correct, features a non-black raven at some finite time. The observation of the non-black raven entails \(H_2\). The hypothesis \(H_2\) is not falsifiable, because if only black ravens are observed forever, then \(H_2\) is incorrect, but there is no finite number of observations that falsifies \(H_2\).
The hypothesis \(H_1 =\) “all (observed) ravens are black” is refutable but not verifiable. It is refutable because any data sequence for which it is not correct, features a non-black raven at some finite time. The observation of the non-black raven falsifies \(H_1\). \(H_1\) is not verifiable, because if only black ravens are observed forever, then \(H_1\) is correct, but there is no finite number of observations that entails \(H_1\).
In the New Riddle of Induction of Section 1.2 (diagram repeated below for convenience), the hypothesis “all (observed) emeralds are green” is falsifiable but not verifiable, for the same reason as “all (observed) ravens are black” is refutable but not verifiable.
Any of the grue hypotheses \(H_t =\) “all (observed) emeralds are grue(t) ” is verifiable and refutable. \(H_t\) is refutable because any complete data sequence for which a gruesome generalization is incorrect will feature a counterexample that falsifies it. \(H_t\) is verifiable because if it is correct, the first observation of a blue emerald at time \(t\) falsifies the hypothesis “all (observed) emeralds are green ” and also falsifies all other \(H_t\) hypotheses.

The example of the gruesome hypotheses shows that an empirical hypothesis can be both verifiable and refutable (sometimes called “decidable” in analogy with computation theory). Other typical examples of decidable empirical claims are singular observations, such as “the first raven is black”, and Boolean combinations of singular observations.

same as figure 4: link to extended description below

Figure 4 [An extended description of figure 4 is in a supplement.]

We will briefly discuss similarities and differences to related concepts from epistemology and the philosophy of science.

Verificationism is part of the philosophy of logical empiricism. The core idea is that for a claim to be meaningful, it must be empirically verifiable. The main difference with our concept is the philosophical objective: the goal of learning theory is not to separate meaningful from meaningless claims, but to characterize the standards of empirical success we can expect from inquiry for a given set of hypotheses. A hypothesis that is verifiable according to the definition above allows inquiry to provide a positive test : when the hypothesis is correct, inquiry will eventually indicate its correctness with certainty (given background knowledge). The specific definitions of “verifiability” offered by the logical empiricists are not equivalent to verifiability in the learning-theoretic sense. For example, strict verificationism holds that “in order to be meaningful a claim must be implied by a finite number of observation sentences.”. No finite number of observation sentences is equivalent to the hypothesis that \(H_2 =\) “some (observed) raven is not black”, because this hypothesis is equivalent to an infinite disjunction of observation sentences (i.e., a non-black raven at time 1, a non-black raven at time 2, …).

Falsificationism is a well-known view in the philosophy of science. The core idea is that for a hypothesis to be scientific, rather than pseudo-scientific or metaphysical, it must be falsifiable in the following sense: “statements …, in order to be ranked as scientific, must be capable of conflicting with possible, or conceivable observations”. (Popper 1962, 39). The main difference with our development is the philosophical objective: the goal of learning theory is not to demarcate scientific hypotheses from pseudo-scientific theories, but to characterize the standards of empirical success we can expect from inquiry for a given set of hypotheses. A hypothesis that is refutable according to the definition above allows inquiry to provide a negative test: when the hypothesis is incorrect, inquiry will eventually indicate its incorrectness with certainty (given background knowledge). The specific definition of “falsifiability” in the Popper quote above is not equivalent to refutability in the learning-theoretic sense [Schulte and Juhl 1996]. For example, the hypothesis \(H =\) “the first raven is black and some other raven is non-black” conflicts with the possible observation that the first raven is white. However, if in fact all observed ravens are black, then \(H\) is incorrect but not falsified by any finite number of observations, hence not refutable according to the learning-theoretic definition. For further discussion of the relationship between Popperian falsification and learning theory see [Genin 2018].

3.2 Point-Set Topology and the Axioms of Verifiability

To further elucidate the learning-theoretic concepts of verifiability and refutability, we note that they satisfy the following fundamental properties. We give informal but rigorous proofs.

A disjunction of verifiable hypotheses is also verifiable.
Proof: Let \(H = H_1\) or \(H_2,\ldots\) or \(H_n\) or … be a disjunction of verifiable hypotheses \(H_i\) (the disjunction may be infinite). Suppose that \(H\) is correct for a complete data sequence. Then some \(H_i\) is correct for the data sequence. Since \(H_i\) is verifiable, there is a finite number of observations that entails \(H_i\), which entails \(H\). So if \(H\) is correct for any complete data sequence, there is a finite number of observations from the sequence that entails \(H\), as required for verifiability. For example, let \(H_i\) be the verifiable hypothesis that there is a non-black raven at time \(i\). Then the hypothesis \(H =\) “some (observed) raven is not black” is equivalent to the disjunction \(H_1\) or \(H_2,\ldots\) or \(H_n\) or …. Since each hypothesis \(H_i\) is verifiable, so is \(H\).
A finite conjunction of verifiable hypotheses is also verifiable.
Proof: Let \(H = H_1\) and \(H_2,\ldots\) and \(H_n\) be a finite conjunction of verifiable hypotheses \(H_i\). Suppose that \(H\) is correct for a complete data sequence. Then each \(H_i\) is correct for the data sequence. Since \(H_i\) is verifiable, there is a finite number of observations that entails \(H_i\). Because there are only finite many hypotheses \(H_i\), eventually each hypothesis will be verified by a finite number of observations, which entails their conjunction \(H\). So if \(H\) is correct for any complete data sequence, there is a finite number of observations from the sequence that entails \(H\), as required for verifiability. For example, let \(H_1\) be the verifiable hypothesis that the first raven is non-black and let \(H_2\) be the verifiable hypothesis that the second raven is non-black. If the conjunction \(H = H_1\) and \(H_2\) is correct for a data sequence, then the first two ravens are not black. The observation of the first two ravens therefore entails \(H\).
A tautology and a contradiction are (trivially) verifiable.
Proof: A tautology (like “the first observed raven is black or is not black”) is correct for any data sequence and entailed by any evidence sequence. A contradiction (like “the first observed raven is black and is not black”) is trivially verified if it is correct because it is never correct.
A hypothesis is verifiable if and only if its negation is refutable.
Proof: We consider the only-if direction; the converse is similar. Suppose that the negation not H of a hypothesis is refutable. Consider any complete data sequence for which hypothesis \(H\) is correct. Then not H is incorrect, and will be falsified by a finite number of observations, since it is refutable. This finite observation set entails \(H\). So if \(H\) is correct for any complete data sequence, there is a finite number of observations from the sequence that entails \(H\), as required for verifiability. For example, \( H=\) “some (observed) raven is not black” is the negation of the refutable hypothesis not \(H =\) “all (observed) ravens are black”. If not \(H\) is incorrect for a complete data sequence, it will be eventually falsified by the observation of a non-black raven. This observation entails \(H\).

Remarkably, the properties listed are exactly the fundamental axioms of an important branch of mathematics known as point-set topology [Abramsky 1987, Vickers 1986]. A topological space is defined by a collection of sets known as open sets or neighbourhoods, that satisfy the axiomatic properties of verifiable hypotheses (closure under arbitrary union and finite disjunction, both the empty set and the entire space are open). The set-theoretic complements of open sets are called closed sets, so refutable hypotheses correspond exactly to closed sets. Point-set topology was invented to support a kind of generalized functional analysis without numbers (more precisely, without distances). It is striking that the foundational axioms of topology have an exact epistemological interpretation in terms of the properties of empirical hypotheses that allow verification or falsification with certainty. Current mathematical developments of learning theory often begin by taking as a basic concept a set of verifiable hypotheses satisfying the properties listed. This approach has two advantages.

Learning theory can draw on, and contribute to, the rich body of concepts and results from one of the most developed branches of modern mathematics [Kelly 1996, Baltag et al. 2015, de Brecht and Yamamoto 2008].
The flexibility to adapt the notion of evidence item to the context of an application makes it easier to apply the general theory in different domains. For example, consider the problem of obtaining increasing precisely measurements of a quantity of interest (e.g. the speed of light in physics). We can take the basic set of verifiable hypotheses to be (unions of) open intervals around the true value of the quantity [Baltag et al. 2015, MONIST, Genin and Kelly 2017]. Another example is the concept of statistical verifiability covered in Section 6 below.

For the sake of concreteness, this entry describes examples where the basic verifiable hypotheses are disjunctions of finite sequences of evidence items. We will describe definitions and results in such a way that they assume only the axiomatic properties listed so that they are easy to apply in other settings.

3.3. Identifiability in the Limit of Inquiry

A fundamental result describes the conditions under which a method can reliably find the correct hypothesis among a countably infinite or finite number \(\mathbf{H}\) of mutually exclusive hypotheses that jointly cover all possibilities consistent with the inquirer’s background assumptions. A learner for \(\mathbf{H}\) maps a finite sequence of observations to a hypothesis in \(\mathbf{H}\). For example in the New Riddle of Induction, the natural projection is a learner for the hypothesis set \(\mathbf{H}\) that comprises “all emeralds are green”, \(H_1 =\) “all emeralds are grue(1)”, \(H_2 =\) “all emeralds are grue(2)”, etc. for all critical times \(t\). A learner reliably identifies, or simply identifies, a correct hypothesis from \(\mathbf{H}\) if for every complete data sequence the following holds: if \(H\) from \(\mathbf{H}\) is correct hypothesis for the data sequence, then there is a finite number of observations such that the learner conjectures the correct hypothesis \(H\) for any further observations consistent with the data sequence. The generalizing method and the natural projection rule are examples of reliable learners for their hypothesis sets.

Theorem. There is a learner that reliably identifies a correct hypothesis from \(\mathbf{H}\) if and only if each hypothesis \(\mathbf{H}\) is a finite or countable disjunction of refutable hypotheses.

For the proof see Kelly [1996, Ch. 3.3].

Example. For illustration, let’s return to the ornithological example with two alternative hypotheses: (1) all but finitely many swans are white, and (2) all but finitely many swans are black. As we saw, it is possible in the long run to reliably settle which of these two hypotheses is correct. Hence by the characterization theorem, each of the two hypotheses must be a disjunction of refutable empirical claims. To see that this indeed is so, observe that “all but finitely many swans are white” is logically equivalent to the disjunction

at most 1 swan is black or at most 2 swans are black … or at most \(n\) swans are black … or … ,

and similarly for “all but finitely many swans are black”. Each of the claims in the disjunction is refutable. For example, take the claim that “at most 3 swans are black”. If this is false, more than 3 black swans will be found, at which point the claim is conclusively falsified. The figure below illustrates how the identifiable hypotheses are structured as disjunctions of refutable hypotheses.

Figure 5 [An extended description of figure 5 is in a supplement.]

The characterization theorem implies that we can think of a reliable method as adopting internal strengthened versions of the original hypotheses under investigation that are refutable. As the example above shows, the theorem does not imply that the strengthened hypotheses are mutually exclusive (e.g.“at most 3 swans are black” is consistent with “at most 2 swans are black”.). A recent alternative characterization theorem due to Baltag, Gierasimczuk, and Smets [2015] provides an alternative structural analysis where identifiable hypotheses are decomposed into mutually exclusive components, as follows.

A hypothesis \(H\) is verirefutable if it is equivalent to the conjunction of a verifiable and a refutable hypothesis (given background knowledge): \(H = (V\) and \(R)\) where \(V\) is verifiable and R is refutable. For example, the hypothesis “exactly 2 swans are black” is verirefutable, since it is equivalent to the conjunction of the verifiable hypothesis “at least 2 swans are white” and the refutable hypothesis “at most 2 swans are white”. The term “verirefutable” is due to [Genin and Kelly 2015]; it signifies that when a verirefutable hypothesis is true, there is some initial condition after which the hypothesis is refutable, that is, the hypothesis will be falsified by data if it is false. Baltag et al. refer to verirefutable hypotheses as locally closed. They establish the following characterization theorem for reliable learning [Baltag et al. 2015].

Theorem. There is a learner that reliably identifies a correct hypothesis from \(\mathbf{H}\) if and only if each hypothesis \(\mathbf{H}\) is equivalent to a finite or countable disjunction of mutually exclusive verirefutable hypotheses.

Since the verirefutable hypotheses are mutually exclusive, they constitute a valid refined hypothesis space whose members entail exactly one of the original hypothesis. The characterization theorem entails that without loss of learning power, inductive methods can transform the original hypothesis space into a verirefutable one. The figure below illustrates the decomposition into verirefutable hypotheses.

Figure 6 [An extended description of figure 6 is in a supplement.]

A few points will help explain the significance of characterization theorems.

Structure of Reliable Methods. Characterization theorems tell us how the structure of reliable methods is attuned to the structure of the hypotheses under investigation. For example, the theorem mentioned establishes a connection between falsifiability and testability, but one that is more attenuated than the naïve Popperian envisions: it is not necessary that the hypotheses under test be directly falsifiable; rather, there must be ways of strengthening each hypothesis that yield a countable number of refutable “subhypotheses”. We can think of these refutable subhypotheses as different ways in which the main hypothesis may be true. (For example, one way in which “all but finitely many swans are white” is true is if there are are at most 10 black swans; another is if there are at most 100 black swans, etc.). Strengthening the original hypotheses so they become empirically refutable matches the spirit of Lakatos’s methodology in which a general scientific paradigm is articulated with auxiliary hypotheses to define testable (i.e., falsifiable) claims.
Import of Background Assumptions. The characterization result draws a line between the solvable and unsolvable problems. Background knowledge reduces the inductive complexity of a problem; with enough background knowledge, the problem crosses the threshold between the unsolvable and the solvable. In many domains of empirical inquiry, the pivotal background assumptions are those that make reliable inquiry feasible. (Kuhn [1970] makes related points about the importance of background assumptions embodied in a “paradigm”).
Language Invariance. Learning-theoretic characterization theorems concern what Kelly calls the “temporal entanglement” of various observation sequences [Kelly 2000]. Ultimately they rest on entailment relations between given evidence, background assumptions and empirical claims. Since logical entailment does not depend on the language we use to frame evidence and hypotheses, the inductive complexity of an empirical problem as determined by the characterization theorems is language-invariant.

4. The Long Run in the Short Run: Reliable and Stable Beliefs

A longstanding criticism of convergence to the truth as an aim of inquiry is that, while fine in itself, this aim is consistent with any crazy behaviour in the short run [Salmon 1991]. For example, we saw in the New Riddle of Induction that a reliable projection rule can conjecture that the next emerald will be blue no matter how many green emeralds have been found—as long as eventually the rule projects “all emeralds are green”. One response is that if means-ends analysis takes into account other epistemic aims in addition to long-run convergence, then it can provide strong guidance for what to conjecture in the short run.

To illustrate this point, let us return to the Goodmanian Riddle of Induction. Ever since Plato, philosophers have considered the idea that stable true belief is better than unstable true belief, and epistemologists such as Sklar [1975] have advocated similar principles of “epistemic conservatism”. Kuhn tells us that a major reason for conservatism in paradigm debates is the cost of changing scientific beliefs [Kuhn 1970]. In this spirit, learning theorists have examined methods that minimize the number of times that they change their theories before settling on their final conjecture [Putnam 1965, Kelly 1996, Jain 1999]. Such methods are said to minimize mind changes.

4.1 Example: The New Riddle of Induction

The New Riddle of Induction turns out to be a nice illustration of this idea. Consider the natural projection rule (conjecture that all emeralds are green on a sample of green emeralds). If all emeralds are green, this rule never changes its conjecture. And if all emeralds are grue\((t)\) for some critical time \(t\), then the natural projection rule abandons its conjecture “all emeralds are green” at time \(t\)—one mind change—and thereafter correctly projects “all emeralds are grue\((t)\)”. Remarkably, rules that project grue rather than green do not do as well. For example, consider a rule that conjectures that all emeralds are grue(3) after observing one green emerald. If two more green emeralds are observed, the rule’s conjecture is falsified and it must eventually change its mind, say to conjecture that all emeralds are green (supposing that green emeralds continue to be found). But then at that point, a blue emerald may appear, forcing a second mind change. This argument can be generalized to show that the aim of minimizing mind changes allows only the green predicate to be projected on a sample of all green emeralds [Schulte 1999]. We saw in Section 1.2 above how the natural projection rule changes its mind at most once; the figure below illustrates in a typical case how an unnatural projection rule may have to change its mind twice or more.

rule example: link to extended description below

Figure 7 [An extended description of figure 7 is in a supplement.]

4.2 More Examples

The same reasoning applies to the question about whether all ravens are black. The bold generalizer that conjectures that all ravens are black after observing samples of only black ravens succeeds with at most one mind change: if indeed all ravens are black, the generalizer never changes its mind at all. And if there is a nonblack raven, the refutation occasions one mind change, but afterwards the question is settled.

Contrast this with the contrary method that asserts that there is a nonblack raven after observing a sample of all black ones. If only black ravens continue to be observed, the contrary method has to eventually change its mind and assert that “all ravens are black”, or else it fails to arrive at the correct generalization. But then at that point, a nonblack raven may appear, forcing a second mind change. Thus the goal of stable belief places strong constraints on what a method may conjecture in the short run for this problem: on observing only black ravens, the options are “all ravens are black” or “no opinion yet”, but not “there is a nonblack raven”.

In the conservation law problem, the restrictive method described in Section 2.1 is the only method that minimizes mind changes. Recall that the restrictive method adopts a set of conservation laws that rule out as many unobserved reactions as possible. It can be shown that if there are \(n\) known elementary particles whose reactions are observed, this method requires at most \(n\) mind changes. (The number of elementary particles in the Standard Model is around \(n = 200).\)

For learning causal graphs, the following variant of the method described in Section 2.2 minimizes the number of mind changes.

Suppose we have observed a set of correlations or associations among a set of variables of interest.
If there is a unique causal graph that explains the observed correlations with a minimum number of direct causal links, select this graph.
If there is more than one causal graph that explains the observed correlations with a minimum number of direct causal links, output “no opinion yet” (or conjecture the disjunction of the minimum edge graphs).

This example illustrates that sometimes minimizing mind changes requires withholding beliefs. Intuitively, this occurs when there are two or more equally simple explanations of the data, and the inquirer has to wait until further observations decide between these possibilities. Jumping to one of the simple conclusions might lead to an unnecessary mind change in case an alternative equally simple explanation turns out to be correct. In such cases there is a trade-off between the goals of achieving stable belief, on the one hand, and quickly settling on a true belief on the other [Schulte 1999]. We discuss the connection between simplicity and stable belief in the next section on simplicity.

4.3 Regressive Mind Changes

Genin and Kelly [2015] refine the mind change approach by distinguishing different kinds of mind changes.

Abandoning a true hypothesis in favor of a false one. This is an undesirable regressive mind change.
Abandoning a false hypothesis in favor of a true one. This is a desirable progressive mind change.
Abandoning a false hypothesis in favor of another false one.

The table below illustrates these distinctions in the New Riddle of Induction and the Raven example. Genin and Kelly investigate the principle that inductive methods should minimize the number of regressive mind changes, that is, the number of times new evidence leads the method to abandon a true hypothesis in favor of a false one. The notion that regressive mind changes are a mark of epistemic failure matches a long tradition in epistemology. Defeasibility theories of knowledge (see the link in the Other Internet Resources section below) hold that in order for an agent’s true belief to count as knowledge, it must be indefeasible in the sense that accepting further propositions should not lead the agent to abanon her belief. Translated into the language of mind changes, this means that an inquirer’s true current conjecture can count as knowledge only if there is no further evidence that would lead her to change her mind and adopt an alternative false conjecture. Plato’s Meno conveys this point vividly.

Now this is an illustration of the nature of true opinions: while they abide with us they are beautiful and fruitful, but they run away out of the human soul, and do not remain long, and therefore they are not of much value …. But when they are bound, in the first place, they have the nature of knowledge; and, in the second place, they are abiding.

True Hypothesis	“All emeralds are green”	“There is a non-black raven”
Conjecture 0	“All emeralds are grue(1)”	“There is a non-black raven”
Observation 1	Green emerald	Black raven
Conjecture 1	“All emeralds are grue(2)” false-to-false	“All ravens are black” true-to-false
Observation 2	Green emerald	White raven
Conjecture 2	“All emeralds are green” false-to-true	“There is a non-black raven” false-to-true

Illustrating regressive and progressive mind changes

While minimizing regressive mind changes is an even more important epistemic goal than avoiding mind changes in general, it leads to weaker strictures on inductive learning. At the same time any strictures that do follow from it carry even more normative force. The table above illustrates the differences between the two principles in the New Riddle of Induction and the Raven problem. In the New Riddle of Induction, if only green emeralds are ever observed, a projection rule may keep projecting any number of gruesome predicates without producing a regressive mind change: it simply abandons one false gruesome predicate for another false gruesome predicate. Therefore even unnatural projection rules incur 0 regressive mind changes, provided they never abandon “the all green hypothesis” once adopted.

The consequences of minimizing regressive mind changes are different for the question of whether all ravens are black. Consider again the contrary method that asserts that there is a nonblack raven after observing a sample of black ones. As shown in the Table and discussed above, the contrary method has to eventually change its hypothesis after seeing more black ravens to conjecture that all ravens are black, and then, upon observing a white raven, return to its true initial hypothesis that there is a nonblack raven. Thus the contrary method undergoes at least one regressive mind change in the worst case. On the other hand, the generalizing method that asserts that all ravens are black after observing a sample of black ones changes its conjecture only when a nonblack raven is observed---a progressive mind change from a false hypothesis to a true hypothesis. Therefore the principle of avoiding regressive mind changes singles out the generalizing method over the contrary one.

As the example illustrates, regressive mind changes are associated with cycles of conjectures. This is because a reliable method must eventually return a true hypothesis after adopting a false one, so a regressive mind change leads to at least one cycle true conjecture-false conjecture-true conjecture. Methods that avoid regressive mind changes are therefore studied under the heading of cycle-free learning [Genin and Kelly 2015] or minimizing U-turns [Carlucci et al. 2005]. Genin and Kelly [2015, 2019] provide a general result that elucidates the general methodological import of avoiding regressive mind changes and cycles of conjectures (described in Section 5.4). Their result belongs to a family of theorems that establish a striking connection between avoiding mind changes and Ockham’s razor, which we discuss in the next section.

5. Simplicity, Stable Belief, and Ockham’s Razor

A strong intuition about inductive inference and scientific method is that we should prefer simpler hypotheses over complex ones; see the entry on simplicity. Statisticians, computer scientists, and other researchers concerned with learning from observations have made extensive use of a preference for simplicity to solve practical inductive problems [Domingos 1999]. From a foundational point of view, simplicity is problematic for at least two reasons.

The justification problem: Why adopt simple hypotheses? One obvious answer is that the world is simple and therefore a complex theory is false. However, the apriori claim that the world is simple is highly controversial—see the entry on simplicity. From a learning-theoretic perspective, dismissing complex hypotheses impairs the reliability of inductive methods. In Kelly’s metaphor, a fixed bias is like a stopped watch: We may happen to use the watch when it is pointing at the right time, but the watch is not a reliable instrument for telling time [Kelly 2007a, 2010].
The description problem: Epistemologists have worried that simplicity is not an objective feature of a hypothesis, but rather “depends on the mode of presentation”, as Nozick puts it. Goodman’s Riddle illustrates this point. If generalizations are framed in blue-green terms, “all emeralds are green” appears simpler than “all emeralds are first green and then blue”. But in a grue-bleen language, “all emeralds are grue” appears simpler than “all emeralds are first grue and then bleen”.

Learning theorists have engaged in recent and ongoing efforts to apply means-ends epistemology to develop a theory of the connection between simplicity and induction that addresses these concerns [Kelly 2010, Harmann and Kulkarni 2007, Luo and Schulte 2006, Steel 2009]. It turns out that a fruitful perspective is to examine the relationship between the structure of a hypothesis space and the mind change complexity of the corresponding inductive problem. The fundamental idea is that, while simplicity does not enjoy an a priori connection with truth, choosing simple hypotheses can help an inquirer find the truth more efficiently, in the sense of avoiding mind changes. Kelly’s road metaphor illustrates the idea. Consider two routes to the destination, one via a straight highway, the other via back roads. Both routes eventually lead to the same point, but the back roads entail more twists and turns [Kelly 2007a, 2010].

A formalization of this idea takes the form of an Ockham Theorem: A theorem that shows (under appropriate restrictions) that an inductive method finds the truth as efficiently as possible for a given problem if and only if the method is the Ockham method, that is, it selects the simplest hypothesis consistent with the data. An Ockham theorem provides a justification for Ockham’s inductive razor as a means towards epistemic aims.

Whether an Ockham theorem is true depends on the description of the Ockham method, that is, on the exact definition of simplicity for a set of hypotheses. There is a body of mathematical results that establish Ockham theorems using a language-invariant simplicity measure, which we explain next.

5.1 Defining Simplicity

Say that a hypothesis \(H\) from a background set of possible hypotheses \(\mathbf{H}\) is verifiable if there is an evidence sequence such that \(H\) is the only hypothesis from \(\mathbf{H}\) that is consistent with the evidence sequence. For example, in the black raven problem above, the hypothesis “there is a nonblack raven” is verifiable since it is entailed by an observation of a nonblack raven. The hypothesis “all ravens are black” is not verifiable, since it is not entailed by any finite evidence sequence. The following procedure assigns a simplicity rank to each hypothesis \(H\) from a set of hypotheses \(\mathbf{H}\) [Apsitis 1994, Luo and Schulte 2006].

Assign all verifiable hypotheses simplicity rank 0.
Remove the verifiable hypotheses from the hypothesis space to form a new hypothesis space \(\mathbf{H}_1.\)
Assign simplicity rank 1 to the hypotheses that are verifiable given \(\mathbf{H}_1.\)
Remove the newly verifiable hypotheses with simplicity rank 1 from the hypothesis space to form a new hypothesis space \(\mathbf{H}_2.\)
Continue removing hypotheses until no new hypotheses are verifiable given the current hypothesis space.
The simplicity rank of each hypothesis \(H\) is the first stage at which it is removed by this procedure. In other words, it is the index of the first restricted hypothesis space that makes \(H\) verifiable.

Hypotheses with higher simplicity rank are regarded as simpler than those with lower ranks. Simplicity ranks are defined in terms of logical entailment relations, hence are language-invariant. Simplicity ranks as defined can be seen as degrees of falsifiability in the following sense. Consider a hypothesis of simplicity rank 1. Such a hypothesis is falsifiable because an evidence sequence that verifies an alternative hypothesis of rank 0 falsifies it. Moreover, a hypothesis of simplicity rank 1 is persistently falsifiable in the sense that it remains falsifiable no matter what evidence sequence consistent with it is observed. A hypothesis of simplicity rank \(n+1\) is persistently falsifiable by hypotheses of rank \(n.\) Let us illustrate the definition in our running examples.

5.2 Examples

In the Riddle of Induction, the verifiable hypotheses are the grue hypotheses with critical time t: any sequence of t green emeralds followed by blue ones entails the corresponding grue(t) generalization. Thus the grue hypotheses receive simplicity rank 0. After the grue hypotheses are eliminated, the only remaining hypothesis is “all emeralds are green”. Given that it is the only possibility in the restricted hypothesis space, “all emeralds are green” is entailed by any sequence of green emeralds. Therefore “all emeralds are green” has simplicity rank 1. After removing the all green hypothesis, no hypotheses remain.
In the raven color problem, the verifiable hypothesis is “a nonblack raven will be observed”, which receives simplicity rank 0. After removing the hypothesis that a nonblack raven will be observed, the only remaining possibility is that only black ravens will be observed, hence this hypothesis is verifiable in the restricted hypothesis space and receives simplicity rank 1.
The simplicity rank of a causal graph is given by the number of direct links not contained in the graph. Therefore the fewer direct links are posited by the causal model, the higher its simplicity rank.
The simplicity rank of a set of conservation laws is given by the number of independent laws. (Independence in the sense of linear algebra.) Therefore the more nonredundant laws are introduced by a theory, the higher its simplicity rank. Each law rules out some reactions, so maximizing the number of independent laws given the observed reactions is equivalent to ruling out as many unobserved reactions as possible.

5.3 Stable Belief and Simplicity: An Ockham Theorem

The following theorem shows the connection between the mind-change complexity of an inductive problem and the simplicity ranking as defined.

Theorem. Let \(\mathbf{H}\) be a set of empirical hypotheses. Then there is a method that reliably identifies a correct hypothesis from \(\mathbf{H}\) in the limit with at most n mind changes if and only if the elimination procedure defined above terminates with an empty set of hypotheses after n stages.

Thus for an inductive problem to be solvable with at most \(n\) mind changes, the maximum simplicity rank of any possible hypothesis is \(n.\) In the Riddle of Induction, the maximum simplicity rank is 1, and therefore this problem can be solved with at most 1 mind change. The next result provides an Ockham theorem connecting simplicity and mind change performance.

Ockham Theorem. Let \(\mathbf{H}\) be a set of empirical hypotheses with optimal mind change bound n. Then an inductive method is mind change optimal if and only if it satisfies the following conditions.

Whenever the method adopts one of the hypotheses from \(\mathbf{H},\) this hypothesis is the uniquely simplest one consistent with the evidence.

If the method changes its mind at inquiry time \(t+1\), the uniquely simplest hypothesis at time \(t\) is falsified at time \(t+1.\)

This theorem says that a mind-change optimal method may withhold a conjecture as a skeptic would, but if it does adopt a definite hypothesis, the hypothesis must be the simplest one, in the sense of having the maximum simplicity rank. Thus the mind change optimal methods discussed in Section 4 are all Ockham methods that adopt the simplest hypothesis consistent with the data. The Ockham theorem shows a remarkable reversal from the long-standing objection that long-run reliability imposes too few constraints on short-run conjectures: If we add to long-run convergence to the truth the goal of achieving stable belief, then in fact there is a unique inductive method that achieves this goal in a given empirical problem. Thus the methodological analysis switches from offering no short-run prescriptions to offering a complete prescription.

5.4 Regressive Mind Changes and Simplicity: Another Ockham Theorem

The previous subsection defines a complete simplicity ranking for every hypothesis under investigation. This means that any hypothesis can be compared to another as simpler or equally simple. A less demanding concept is a partial order, which allows that some hypotheses may simply not be comparable, like apples and oranges. Genin and Kelly [2015] show that the following partial order leads to an Ockham principle for avoiding regressive mind changes (see Section 4.3).

An observation sequence separates hypothesis \(H_1\) from hypothesis \(H_2\) if the observations are consistent with \(H_1\) and falsify \(H_2\) (given background knowledge).
Hypothesis \(H_1\) is inseparable from \(H_2\), written \(H_1 \lt H_2\), if no observation sequence separates \(H_1\) from \(H_2\). Equivalently, \(H_1 \lt H_2\) if and only if any evidence consistent with \(H_1\) is also consistent with \(H_2.\)

The separation terminology is due to Smets et al., who relate it to separation principles in point-set topology. In terms of the epistemological interpretation of point-set topology from Section 3.2, we have \(H_1 \lt H_2\) if and only if every complete data sequence for \(H_1\) is a boundary point for the data sequences of \(H_2.\) In an epistemologically resonant phrase, Genin and Kelly say that hypothesis \(H_1\) “faces the problem of induction” with respect to \(H_2\) whenever \(H_1 \lt H_2\). This is because whenever \(H_1\) is correct, a reliable learner will have to take an “inductive leap” and conjecture \(H_1\) although any finite amount of evidence is also consistent with \(H_2\).

Examples

In the raven problem, \(H_1 =\) “all ravens are black” \(\lt H_2 =\) “some raven is not black”. But it is not the case that “some raven is not black” \(\lt\) “all ravens are black” because the observation of a white raven separates \(H_2\) from \(H_1.\)
In causal graph learning, if graph \(G_1\) contains a subset of edges (direct causal links) of those in alternative graph \(G_2\), then \(G_1 \lt G_2\). This is because any correlations that can be explained by \(G_1\) can also be explained by the larger graph \(G_2\).
In curve fitting, \(L \lt Q\) where \(L\) is the set of linear functions, and \(Q\) is the set of quadratic functions. This is because any set of points that can be fit by a linear function can also be fit by a quadratic function.

These examples suggest that the \(\lt\) partial order corresponds to our intuitive simplicity judgements about empirical hypotheses; Genin and Kelly [2019] provide an extensive defense of this claim. It can be shown that the \(\lt\) ordering agrees with the simplicity ranks defined in the previous subsection, in the sense that if \(H_1 \lt H_2\) but not \(H_2 \lt H_1\), then the simplicity rank of \(H_1\) is less than the rank of \(H_2\). These observations motivate an Ockham principle: An inductive method satisfies the Ockham principle with respect to separability if it always conjectures a maximally simple hypothesis \(H\) consistent with the evidence. In our notation, if an Ockham method adopts a hypothesis \(H\) given a finite observation sequence, then there is no alternative simpler hypothesis \(H'\) such that \(H' \lt H\). That is, every alternative hypothesis \(H'\) will eventually be separated from \(H\) by the evidence if \(H'\) is true. In the raven example, the generalizing method satisfies the Ockham principle, but the contrary method does not, because it adopts \(H_2 =\) “some raven is not black”. The following theorem shows that the connection between the Ockham principle and regressive mind changes is general.

Theorem. If an inductive method avoids conjecture cycles (and hence regressive mind changes), it satisfies the Ockham principle with respect to separability.

For a proof see Genin and Kelly [2015; Theorem 10]. Genin and Kelly also provide sufficient conditions for avoiding conjecture cycles.

While the results in this section establish a fruitful connection between simplicity and mind-change optimality, a limitation of the approach is that it requires that some hypotheses must be conclusively entailed or falsified by some evidence sequence. This is typically not the case for statistical models, where the probability of a hypothesis may become arbitrarily small but usually not 0. For instance, consider a coin flip problem and the hypothesis “the probability of heads is 90%”. If we observe one million tails, the probability of the hypothesis is very small indeed, but it is not 0, because any number of tails is logically consistent with a high probability of heads. The next section discusses how a reliabilist approach can be adapted to statistical hypotheses.

6. Reliable Learning for Statistical Hypotheses

Statistical hypotheses are the most common in practical data-driven decision making, for example in the sciences and engineering. It is therefore important for a philosophical framework of inductive inference to include statistical hypotheses. There are two key differences between statistical hypotheses and the hypotheses sets we have considered so far [Sober 2015].

The relationship between observations and a hypothesis is probabilistic, not deductive: A statistical hypothesis assigns a probability to an observation sequence, typically between 0 and 1. A deductive hypothesis is either consistent with an observation sequence or falsified.
The analysis of statistical hypotheses typically assumes that observations form a random sample: successive observations are independent of each other and follow the same distribution. It is possible to analyze statistical methods where later observations depend on current observations, but the mathematical complexity of inductive methodology is much greater than with independent data.

Because of these properties, learning theory for nonstatistical methods is a more straightforward framework than statistics for traditional philosophical discussions in epistemology, inductive inference, and the philosophy of science. For example, epistemological discussions of justified true belief concern a deductive concept of belief where the inquirer accepts a proposition, rather than assign a probability to data. Scientific theories typically make a deterministic prediction of future data from past observations (initial conditions), so an independence requirement makes it more difficult to apply a methodological framework to understand scientific inquiry (see our case studies).

Normative means-ends epistemology can be applied to statistical hypotheses as well as deductive ones. In particular we will discuss how the ideas of reliable convergence to the truth and minimizing regressive mind changes can be adapted to the statistical setting. The key idea is to shift the unit of analysis: Whereas previously we considered the behavior of an inductive method for a specific data sequence, in statistical analysis we consider its aggregate behavior over a set of data sequences of the same length. In particular, we consider the probability that a method conjectures a hypothesis H for a given number of observations \(n.\)

Preliminaries on Statistical Hypotheses

We will illustrate the main ideas with a classic simple example, observing coin flips, and indicate how they can be generalized to more complex hypotheses. For more details please see [Genin and Kelly 2017, Genin 2018]. Suppose that an investigator has a question about the unknown bias \(p\) of a coin, where \(p\) represents the chance that a single flip comes out “Heads”. Different possible hypotheses correspond to different ranges of the bias \(p\), that is, a partition of [0,1], the range of the bias. Let us say that the investigator asks a simple point hypothesis: is the coin fair? Then we have

\(H_1 =\) “\(p = 0.5\)”
\(H_2 =\) “it is not the case that \(p = 0.5\)”. That is, either \(p \lt 0.5\) or \(p \gt 0.5.\)

Extending our previous terminology, we shall say that a true bias value \(p\) is for a hypothesis H if it lies within the set specified by \(H\). In our example, a bias value p is correct for \(H_1\) if and only if \(p = 0.5\); otherwise \(p\) is correct for \(H_2\). Given a true bias value \(p\), and assuming independence, we can compute a probability for any finite sequence of observations. This probability is known as the sample distribution. For example, for a fair coin with \(p = 0.5,\) the probability of observing 3 heads is \(0.5 \times 0.5 \times 0.5 = 0.125.\) If the chance of heads is 0.7, the probability of observing 3 heads is \(0.7 \times 0.7 \times 0.7 = 0.343.\) Notice how the independence assumption allows us to compute the probability of a sequence of observations as the product of single observation probabilities. Without the independence assumption, we cannot infer the probability of multiple observations from the probability of a single observation, and the sample distribution is not defined.

As usual in this entry, an inductive method conjectures a hypothesis after observing a finite sequence of observations. A method that conjectures a statistical hypothesis is called a statistical test (see the link in the Other Internet Resource section below). The statistical literature provides an extensive collection of computationally efficient statistical tests for different types of statistical hypothesis. In the following discussion we consider the general learning performance of such methods, with respect to reliable convergence to a true hypothesis and avoiding mind changes. Consider a fixed observation length \(n,\) called the sample size. For sample size \(n\), there is a set of samples of length \(n\) such that the method conjectures hypothesis \(H\)given the sample. For example, for \(n = 3\), the method might conjecture \(H_2 =\) “the coin is not fair” after observing 3 heads. The aggregate probability that the method outputs hypothesis \(H\) given some sample of length \(n\), is the sum of the sample probabilities of the samples such that the method conjectures \(H\) given the sample. In the supplement we give example computations of the aggregate probability. Because this aggregate probability is the key quantity for the methodology of statistical hypotheses, we introduce the following notation for it.

\(P_{n,p}(H) =\) the probability that a given inductive method conjectures hypothesis \(H\) after \(n\) observations, given that the true probability of a single observation is \(p\)

In nonstatistical learning, we required a reliable method to eventually settle on true hypothesis after sufficiently many observations. The statistical version of this criterion is that after sufficiently many observations, the chance of conjecturing the true hypothesis should approach 100%. More technically, say that a method identifies a true statistical hypothesis in chance if for every bias value \(p\), and for every threshold \(0\lt t \lt 1\), there is a sample size \(n\) such that for all larger sample sizes, the method conjectures the hypothesis \(H\) that is true for \(p\) with at least probability \(t\). In symbols, we have \(P_{n',p}(H) \gt t\) for all sample sizes \(n' \gt n\), where \(H\) is the hypothesis that is true for \(p\). The figure below illustrates how the chance of conjecturing the true hypothesis increases with sample size, whereas the chance of conjecturing a false hypothesis decreases with sample size. The definition can be generalized to more complex statistical hypotheses by replacing the true bias value \(p\) with a list of parameters.

Figure 8 [An extended description of figure 8 is in a supplement.]

The notion of limiting identification in chance is similar to the concept of limiting convergence to a probability estimate in Reichenbach’s pragmatic vindication. Translated to our example, Reichenbach considered inductive rules that output an estimate of the true bias value \(p,\) and required that such a rule converges to the true value, in the sense that for every every bias value \(p\), and for every threshold \(0\lt t \lt 1\), there is a sample size \(n\) such that for all larger sample sizes, with probability 1 the rule outputs an estimate that differs from the true value \(p\) by at most \(t\). In statistics, a method is called consistent if with increasing sample size, the method’s chance of conjecturing a correct answer converges to 100% (see the link in the Other Internet Resources section below). The terminology is unfortunate in that it suggests to philosophical readers a connection with the consistency of a formal proof system. In fact, the statistical concept of consistency has nothing to do with deductive logic; rather, it is a probabilistic analogue of the notion of identification in the limit of inquiry that is the main subject of this entry.

Genin and Kelly provide a characterization theorem that provides necessary and sufficient conditions for a set of statistical hypotheses to be identifiable in chance, analogous to the structural conditions we discussed in Section 3.3 [2017; Theorem 4.3]. Genin [2018] discusses a statistical analogue of the requirement of minimizing mind changes. Recall that a regressive mind change occurs when an inquirer abandons a true hypothesis in favor of a false one Section 4.3. The probabilistic analogue is a chance reversal, which occurs when the chance of conjecturing a true hypotheses decreases as the sample size increases. For instance, consider the question of whether a vaccine is effective for an infectious disease. Suppose the vaccine manufacture runs a trial with 1000 patients and has designed a statistical method that has a chance of 90% of correctly indicating that the vaccine is effective when that is indeed the case. Now another trial is run with 1500 patients using the same statistical method. A chance reversal would occur if the method’s chance of correctly indicating that the vaccine is effective drops to 80%. As this example illustrates, a chance reversal corresponds to a failure to replicate a true result. A chance reversal is illustrated in the Figure above, where the chance of conjecturing the true hypothesis is smaller for 2 samples than for 3. Although chance reversals are clearly undesirable, they are difficult to avoid, and in fact commonly used statistical methods are liable to such reversals [Genin 2018]. A more feasible goal is to bound the reversals by a threshold \(t\), such that if the chance of conjecturing the truth does decreases with increasing sample size, it decreases by at most \(t.\) (In symbols, \(P_{n,p}(H) - P_{n+1,p}(H) \lt t\) for all sample sizes \(n\) and true bias values \(p\), where \(H\) is the hypothesis correct for \(p\).) Genin [2018] shows that bounded chance reversal are feasible in many situations, and provides an Ockham theorem that elucidates the constraints that bounding chance reversals provides on statistical hypothesis learning.

7. Other Approaches: Categorical vs. Hypothetical Imperatives

Kant distinguished between categorical imperatives that one ought to follow regardless of one’s personal aim and circumstances, and hypothetical imperatives that direct us to employ our means towards our chosen end. One way to think of learning theory is as the study of hypothetical imperatives for empirical inquiry. Many epistemologists have proposed various categorical imperatives for inductive inquiry, for example in the form of an “inductive logic” or norms of “epistemic rationality”. In principle, there are three possible relationships between hypothetical and categorical imperatives for empirical inquiry.

1. The categorical imperative will lead an inquirer to obtain his cognitive goals. In that case means-ends analysis vindicates the categorical imperative. For example, when faced with a simple universal generalization such as “all ravens are black”, we saw above that following the Popperian recipe of adopting the falsifiable generalization and sticking to it until a counterexample appears leads to a reliable method.

2. The categorical imperative may prevent an inquirer from achieving his aims. In that case the categorical imperative restricts the scope of inquiry. For example, in the case of the two alternative generalizations with exceptions, the principle of maintaining a universal generalization until it is falsified leads to an unreliable method (cf. [Kelly 1996, Ch. 9.4]).

3. Some methods meet both the categorical imperative and the goals of inquiry, and others don’t. Then we may take the best of both worlds and choose those methods that attain the goals of inquiry and satisfy categorical imperatives. (See the further discussion in this section.)

For a proposed norm of inquiry, we can apply means-ends analysis to ask whether the norm helps or hinders the aims of inquiry. This was the spirit of Putnam’s critique of Carnap’s confirmation functions [Putnam 1963]: the thrust of his essay was that Carnap’s methods were not as reliable in detecting general patterns as other methods would be. More recently, learning theorists have investigated the power of Bayesian conditioning (see the entry on Bayesian epistemology). John Earman has conjectured that if there is any reliable method for a given problem, then there is a reliable method that proceeds by Bayesian updating [Earman 1992, Ch.9, Sec.6]. Cory Juhl [1997] provided a partial confirmation of Earman’s conjecture: He proved that it holds when there are only two potential evidence items (e.g., “emerald is green” vs. “emerald is blue”). The general case is still open.

Epistemic conservatism is a methodological norm that has been prominent in philosophy at least since Quine’s notion of “minimal mutilation” of our beliefs [1951]. One version of epistemic conservatism, as we saw above, holds that inquiry should seek stable belief. Another formulation, closer to Quine’s, is the general precept that belief changes in light of new evidence should be minimal. Fairly recent work in philosophical logic has proposed a number of criteria for minimal belief change known as the AGM axioms [Gärdenfors 1988]. Learning theorists have shown that whenever there is a reliable method for investigating an empirical question, there is one that proceeds via minimal changes (as defined by the AGM postulates). The properties of reliable inquiry with minimal belief changes are investigated in [Martin and Osherson 1998, Kelly 1999, Baltag et al. 2011, Baltag et al. 2015].

Much of computational learning theory focuses on inquirers with bounded rationality, that is, agents with cognitive limitations such as a finite memory or bounded computational capacities. Many categorical norms that do not interfere with empirical success for logically omniscient agents nonetheless limit the scope of cognitively bounded agents. For example, consider the norm of consistency: Believe that a hypothesis is false as soon as the evidence is logically inconsistent with it. The consistency principle is part of both Bayesian confirmation theory and AGM belief revision. Kelly and Schulte [1995] show that consistency prevents even agents with infinitely uncomputable cognitive powers from reliably assessing certain hypotheses. The moral is that if a theory is sufficiently complex, agents who are not logically omniscient may be unable to determine immediately whether a given piece of evidence is consistent with the theory, and need to collect more data to detect the inconsistency. But the consistency principle—and a fortiori, Bayesian updating and AGM belief revision— do not acknowledge the usefulness of “wait and see more” as a scientific strategy.

More reflection on these and other philosophical issues in means-ends epistemology can be found in sources such as Huber [2018], [Glymour 1991], [Kelly 1996, Chs. 2,3], [Glymour and Kelly 1992], [Kelly et al. 1997], [Glymour 1994], [Bub 1994]. Of particular interest in the philosophy of science may be learning-theoretic models that accommodate historicist and relativist conceptions of inquiry, chiefly by expanding the notion of an inductive method so that methods may actively select paradigms for inquiry; for more details on this topic, see [Kelly 2000, Kelly 1996, Ch.13]. Booklength introductions to the mathematics of learning theory are [Kelly 1996, Martin and Osherson 1998, Jain et al. 1999]. “Induction, Algorithmic Learning Theory and Philosophy” is a recent collection of writings on learning theory [Friend et al. 2007]. Contributions include introductory papers (Harizanov, Schulte), mathematical advances (Martin, Sharma, Stephan, Kalantari), philosophical reflections on the strengths and implications of learning theory (Glymour, Larvor, Friend), applications of the theory to philosophical problems (Kelly), and a discussion of learning-theoretic thinking in the history of philosophy (Goethe).

Supplementary Document: Basic Formal Definitions

Bibliography

Abramsky, S., 1987. Domain Theory and the Logic of Observable Properties, Ph.D. Dissertation, University of London.
Apsitis, K., 1994. “Derived sets and inductive inference”, in Proceedings of the 5th International Work on Algorithmic Learning Theory, S. Arikawa, K.P. Jantke (eds.), Berlin, Heidelberg: Springer, pp. 26–39.
Baltag, A. and Smets, S., 2011. “Keep changing your beliefs, aiming for the truth”, Erkenntnis, 75(2): 255–270.
Baltag, A., Gierasimczuk, N., Smets, S., 2015. “On the Solvability of Inductive Problems: A Study in Epistemic Topology”, Proceedings of the 15th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2015), , pp. 65–74. Electronic Proceedings in Theoretical Computer Science available online.
Bub, J., 1994. “Testing Models of Cognition Through the Analysis of Brain-Damaged Performance”, British Journal for the Philosophy of Science, 45: 837–55.
Carlucci, L., Case, J., Jain, S. and Stephan, F., 2005. “Non U-shaped vacillatory and team learning”, in International Conference on Algorithmic Learning Theory, Berlin, Heidelberg: Springer, pp. 241–255.
Chart, D., 2000. “Schulte and Goodman’s Riddle”, British Journal for the Philosophy of Science, 51: 837–55.
de Brecht, M. and Yamamoto, A., 2008. “Topological properties of concept spaces”, in International Conference on Algorithmic Learning Theory, Berlin, Heidelberg: Springer, pp. 374–388.
Domingos, P., 1999. “The role of Occam’s razor in knowledge discovery”, Data mining and Knowledge discovery, 3(4): 409–425.
Earman, J., 1992. Bayes or Bust?, Cambridge, Mass.: MIT Press.
Feynman, R., 1965. The Character of Physical Law, Cambridge, Mass.: MIT Press; 19th edition, 1990.
Friend, M. and N. Goethe and V. Harazinov (eds.), 2007. Induction, Algorithmic Learning Theory, and Philosophy, Dordrecht: Springer, pp. 111–144.
Ford, K., 1963. The World of Elementary Particles, New York: Blaisdell Publishing.
Gärdenfors, P., 1988. Knowledge In Flux: modeling the dynamics of epistemic states, Cambridge, Mass.: MIT Press.
Genin, K., 2018. “The Topology of Statistical Inquiry”, Ph.D. Dissertation, Department of Philosophy, Carnegie Mellon University, Genin 2018 available online.
Genin, K. and Kelly, K., 2015. “Theory Choice, Theory Change, and Inductive Truth-Conduciveness”, Proceedings of the 15th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2015). Publisher: Electronic Proceedings in Theoretical Computer Science. Extended Abstract, Genin & Kelly 2015 available online.
–––, 2017. “The Topology of Statistical Verifiability”, Proceedings of the 17th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2017). Electronic Proceedings in Theoretical Computer Science, preprint available online .
–––, 2019. “Theory Choice, Theory Change, and Inductive Truth-Conduciveness”, Studia Logica, 107: 949–989.
Glymour, C., 1991. “The Hierarchies of Knowledge and the Mathematics of Discovery”, Minds and Machines, 1: 75–95.
–––, 1994. “On the Methods of Cognitive Neuropsychology”, British Journal for the Philosophy of Science, 45: 815–35.
Glymour, C. and Kelly, K., 1992. “Thoroughly Modern Meno”, in Inference, Explanation and Other Frustrations, John Earman (ed.), Berkeley: University of California Press.
Gold, E., 1967. “Language Identification in the Limit”, Information and Control, 10: 447–474.
Goodman, N., 1983. Fact, Fiction and Forecast, Cambridge, MA: Harvard University Press.
Harrell, M., 2000. Chaos and Reliable Knowledge, Ph.D. Dissertation, University of California at San Diego.
Harman, G. and Kulkarni, S., 2007. Reliable Reasoning: Induction and Statistical Learning Theory, Cambridge, MA: The MIT Press.
Huber, F., 2018. A Logical Introduction to Probability and Induction, Oxford: Oxford University Press.
Jain, S., et al., 1999. Systems That Learn, 2^nd edition, Cambridge, MA: MIT Press.
James, W., 1982. “The Will To Believe”, in Pragmatism, H.S. Thayer (ed.), Indianapolis: Hackett.
Juhl, C., 1997. “Objectively Reliable Subjective Probabilities”, Synthese, 109: 293–309.
Kelly, K., 1996. The Logic of Reliable Inquiry, Oxford: Oxford University Press.
–––, 1999. “ Iterated Belief Revision, Reliability, and Inductive Amnesia”, Erkenntnis, 50: 11–58.
–––, 2000. “The Logic of Success”, British Journal for the Philosophy of Science, 51(4): 639–660.
–––, 2007a. “How Simplicity Helps You Find the Truth Without Pointing at it”, in Induction, Algorithmic Learning Theory, and Philosophy, M. Friend, N. Goethe and V. Harazinov (eds.), Dordrecht: Springer, pp. 111–144.
–––, 2008. ‘Ockham’s Razor, Truth, and Information’, in Handbook of the Philosophy of Information, J. van Behthem and P. Adriaans (eds.), Dordrecht: Elsevier.
–––, 2010. “Simplicity, Truth, and Probability”, in Handbook for the Philosophy of Statistics, Prasanta S. Bandyopadhyay and Malcolm Forster (eds.), Dordrecht: Elsevier.
Kelly, K., and Schulte, O., 1995. “The Computable Testability of Theories Making Uncomputable Predictions”, Erkenntnis, 43: 29–66.
Kelly, K., Schulte, O. and Juhl, C., 1997. “Learning Theory and the Philosophy of Science”, Philosophy of Science, 64: 245–67.
Kuhn, T., 1970. The Structure of Scientific Revolutions. Chicago: University of Chicago Press.
Luo, W. and Schulte O., 2006. “Mind Change Efficient Learning”, in Logic and Computation, 204: 989–1011.
Martin, E. and Osherson, D., 1998. Elements of Scientific Inquiry, Cambridge, MA: MIT Press.
Ne’eman, Y. and Kirsh, Y., 1983. The Particle Hunters, Cambridge: Cambridge University Press.
Omnes, R., 1971. Introduction to Particle Physics, London, New York: Wiley Interscience.
Popper, Karl, 1962. Conjectures and refutations. The growth of scientific knowledge, New York: Basic Books.
Putnam, H., 1963. “Degree of Confirmation and Inductive Logic”, in The Philosophy of Rudolf Carnap, P.A. Schilpp (ed.), La Salle, Ill: Open Court.
Putnam, H., 1965. “Trial and Error Predicates and the Solution to a Problem of Mostowski”, Journal of Symbolic Logic, 30(1): 49–57.
Quine, W., 1951. “Two Dogmas of Empiricism”, Philosophical Review, 60: 20–43.
Salmon, W., 1991. “Hans Reichenbach’s Vindication of Induction”, Erkenntnis, 35: 99–122.
Schulte, O., 1999. “Means-Ends Epistemology”, The British Journal for the Philosophy of Science, 50: 1–31.
–––, 2008. “The Co-Discovery of Conservation Laws and Particle Families”, Studies in History and Philosophy of Modern Physics, 39(2): 288–314.
–––, 2009. “Simultaneous Discovery of Conservation Laws and Hidden Particles With Smith Matrix Decomposition”, in Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09), Palo Alto: AAAI Press pp. 1481-1487.
Schulte, O., Luo, W., and Greiner, R., 2007. “Mind Change Optimal Learning of Bayes Net Structure”, in Proceedings of the 20th Annual Conference on Learning Theory (COLT’07, San Diego, CA, June 12–15), N. Bshouti and C. Gentile (eds.), Berlin, Heidelberg: Springer, pp. 187–202.
Schulte, O., and Cory Juhl, 1996. “Topology as Epistemology”, The Monist, 79(1): 141–147.
Sklar, L., 1975. “Methodological Conservatism”, Philosophical Review, 84: 374–400.
Sober, E., 2015. Ockham’s Razors, Cambridge: Cambridge University Press.
Spirtes, P., Glymour, C., Scheines, R., 2000. Causation, prediction, and search, Cambridge, MA: MIT Press.
Steel, D., 2009. “Testability and Ockham’s Razor: How Formal and Statistical Learning Theory Converge in the New Riddle of Induction,” Journal of Philosophical Logic, 38: 471–489.
–––, 2010. “What if the principle of induction is normative? Formal learning theory and Hume’s problem”, International Studies in the Philosophy of Science, 24(2): 171–185.
Valiant, L. G., 1984. “A theory of the learnable”, Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing (STOC 84), New York: ACM Press, pp. 436–445.
Vickers, S., 1996. Topology Via Logic, Cambridge: Cambridge University Press.

Academic Tools

How to cite this entry.

Preview the PDF version of this entry at the Friends of the SEP Society.

Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO).

Enhanced bibliography for this entry at PhilPapers, with links to its database.

Other Internet Resources

Learning Theory in Computer Science
Inductive Logic Website on Formal Learning Theory and Belief Revision
Defeasibility analyses, Section 2 of the entry on the Gettier Problem in the Routledge Encyclopedia of Philosophy.
Statistical hypothesis testing, entry in Wikipedia.
Consistency (in Statistics) entry in Wikipedia.

This is a file in the archives of the Stanford Encyclopedia of Philosophy.
Please note that some links may no longer be functional.

Formal Learning Theory