Stanford Encyclopedia of Philosophy
This is a file in the archives of the Stanford Encyclopedia of Philosophy.

The Problem of Induction

First published Wed Nov 15, 2006; substantive revision Tue Mar 2, 2010

Until about the middle of the previous century induction was treated as a quite specific method of inference: inference of a universal affirmative proposition (All swans are white) from its instances (a is a white swan, b is a white swan, etc.) The method had also a probabilistic form, in which the conclusion stated a probabilistic connection between the properties in question. It is no longer possible to think of induction in such a restricted way; much synthetic or contingent inference is now taken to be inductive; some authorities go so far as to count all contingent inference as inductive. One powerful force driving this lexical shift was certainly the erosion of the intimate classical relation between logical truth and logical form; propositions had classically been categorized as universal or particular, negative or affirmative; and modern logic renders those distinctions unimportant. (The paradox of the ravens makes this evident.) The distinction between logic and mathematics also waned in the twentieth century, and this, along with the simple axiomatization of probability by Kolmogorov in 1933 (Kolmogorov, FTP) blended probabilistic and inductive methods, blending in the process structural differences among inferences.

As induction expanded and became more amorphous, the problem of induction was transformed too. The classical problem if apparently insoluble was simply stated, but the contemporary problem of induction has no such crisp formulation. The approach taken here is to provide brief expositions of several distinctive accounts of induction. This is not comprehensive, there are other ways to look at the problem, but the untutored reader may gain at least a map of the terrain.


1. What is the Problem?

The Oxford English Dictionary defines “induction”, in the sense relevant here, as follows:

7. Logic a. The process of inferring a general law or principle from the observation of particular instances (opposed to DEDUCTION, q.v.).

That induction is opposed to deduction is not quite right, and the rest of the definition is outdated and too narrow: much of what contemporary epistemology, logic, and the philosophy of science count as induction infers neither from observation nor from particulars and does not lead to general laws or principles. This is not to denigrate the leading authority on English vocabulary—until the middle of the previous century induction was understood to be what we now know as enumerative induction or universal inference; inference from particular instances:

a1, a2, …, an are all Fs that are also G,

to a general law or principle

All Fs are G.

A weaker form of enumerative induction, singular predictive inference, leads not to a generalization but to a singular prediction:

1. a1, a2, …, an are all Fs that are also G.

2. an+1 is also F.

Therefore,

3. an+1 is also G.

Singular predictive inference also has a more general probabilistic form:

1. The proportion p of observed Fs have also been Gs.

2. a, not yet observed, is an F.

Therefore,

3. It is probable that a is a G.

The problem of induction was, until recently, taken to be to justify these forms of inference; to show that the truth of the premises supported, if it did not entail, the truth of the conclusion. The evolution and generalization of this question—the traditional problem has become a special case—is discussed in some detail below. Section 3, in particular, points out some essential difficulties in the traditional view of enumerative induction.

1.1 Mathematical induction

As concerns the parenthetical opposition between induction and deduction; the classical way to characterize valid deductive inference is as follows: a set of premises deductively entails a conclusion if no way of interpreting the non-logical signs, holding constant the meanings of the logical signs, can make the premises true and the conclusion false. For present purposes the logical signs include always the truth-functional connectives (and, not, etc) the quantifiers (all, some) and the sign of identity (=). Enumerative induction and singular predictive inference are clearly not valid deductive methods when deduction is understood in this way. (A few revealing counterexamples are to be found in section 3.2 below.)

Regarded in this way, mathematical induction is a deductive method, and is in this opposed to induction in the sense at issue here. Mathematical induction is the following inferential rule (F is any numerical property):

Premises:
  • 0 has the property F.
  • For every number n, if n has the property F then n+1 has the property F.

Conclusion:

  • Every number has the property F.

When the logical signs are expanded to include the basic vocabulary of arithmetic (   is a number, +, ×, ′, 0) mathematical induction is seen to be a deductively valid method: any interpretation in which these signs have their standard arithmetical meaning is one in which the truth of the premises assures the truth of the conclusion. Mathematical induction, we might say, is deductively valid in arithmetic, if not in pure logic.

Mathematical induction should thus be distinguished from induction in the sense of present concern. Mathematical induction will concern us no further beyond a brief terminological remark: the kinship with non-mathematical induction and its problems is fostered by the particular-to-general clause in the common definition. (See section 5.4 of the entry on Frege's logic, theorem, and foundations for arithmetic, for a more complete discussion and justification of mathematical induction.)

1.2 The contemporary notion of induction

A few simple counterexamples to the OED definition may suggest the increased breadth of the contemporary notion:

  1. There are (good) inductions with general premises and particular conclusions:
    All observed emeralds have been green.
    Therefore, the next emerald to be observed will be green.
  2. There are valid deductions with particular premises and general conclusions:
    New York is east of the Mississippi.
    Delaware is east of the Mississippi.
    Therefore, everything that is either New York or Delaware is east of the Mississippi.

Further, on at least one serious view, due in differing variations to Mill and Carnap, induction has not to do with generality at all; its primary form is the singular predictive inference—the second form of enumerative induction mentioned above—which leads from particular premises to a particular conclusion. The inference to generality is a dispensable middle step.

Although inductive inference is not easily characterized, we do have a clear mark of induction. Inductive inferences are contingent, deductive inferences are necessary. (But see the entry Formal Learning Theory where this distinction is elaborated.) Deductive inference can never support contingent judgments such as meteorological forecasts, nor can deduction alone explain the breakdown of one's car, discover the genotype of a new virus, or reconstruct fourteenth century trade routes. Inductive inference can do these things more or less successfully because, in Peirce's phrase, inductions are ampliative. Induction can amplify and generalize our experience, broaden and deepen our empirical knowledge. Deduction on the other hand is explicative. Deduction orders and rearranges our knowledge without adding to its content.

Of course, the contingent power of induction brings with it the risk of error. Even the best inductive methods applied to all available evidence may get it wrong; good inductions may lead from true premises to false conclusions. (A competent but erroneous diagnosis of a rare disease, a sound but false forecast of summer sunshine in the desert.) An appreciation of this principle is a signal feature of the shift from the traditional to the contemporary problem of induction. (See sections 3.2 and 3.3 below.)

How to tell good inductions from bad deductions? That question is a simple formulation of the problem of induction. In its general form it clearly has no substantive answer, but its instances can yield modest and useful questions. Some of these questions, and proposed answers to them, are surveyed in what follows.

Some authorities, Carnap in the opening paragraph of (Carnap 1952) is an example, take inductive inference to include all non-deductive inference. That may be a bit too inclusive; perception and memory are clearly ampliative but their exercise seems not to be congruent with what we know of induction, and the present article is not concerned with them. (See the entries on epistemological problems of perception and epistemological problems of memory.)

Testimony is another matter. Although testimony is not a form of induction, induction would be all but paralyzed were it not nourished by testimony. Scientific inductions depend upon data transmitted and supported by testimony and even our everyday inductive inferences typically rest upon premises that come to us indirectly. (See the remarks on testimony in section 8.4.3, and the entry on epistemological problems of testimony.)

1.3 Can induction be justified?

There is a simple argument, due in its first form to Hume (Hume THN, I.III.VI) that induction (not Hume's word) cannot be justified. The argument is a dilemma: Since induction is a contingent method—even good inductions may lead from truths to falsehoods—there can be no deductive justification for induction. Any inductive justification of induction would, on the other hand, be circular. Hume himself takes the edge off this argument later in the Treatise. “In every judgment,” he writes, “…we ought always to correct the first judgment, deriv'd from the nature of the object, by another judgment, deriv'd from the nature of the understanding” (Hume THN, 181f.).

A more general question is this: Why trust induction more than other methods of fixing belief? Why not consult sacred writings, the pronouncements of authorities or “the wisdom of crowds” to explain the movements of the planets, the weather, automotive breakdowns or the evolution of species? We return to these and related questions in section 8.3.

2. Hume, induction and justification

The source for the problem of induction as we know it is Hume's brief argument in Book I, Part III, section VI of the Treatise, (Hume THN). The great historical importance of this argument, not to speak of its intrinsic power, recommends that reflection on the problem begin with a rehearsal of it. The brief summary in sections 10 and 11 of the entry on Hume provides what is needed, and those who are not familiar with the argument are well advised to read them in conjunction with the present section; It will also be helpful in understanding the deceptively simple argument to have some idea of Hume's project in the Treatise. For this section 4 of that entry is most useful. Indeed, the first twelve sections of the article serve as a brief and comprehensive introduction to Hume's theory of knowledge. Reference to this article permits an abbreviated account here of his classic argument.

First two notes about vocabulary. The term ‘induction’ does not appear in Hume's argument, nor anywhere in the Treatise or the first Inquiry, for that matter. Hume's concern is with inferences concerning causal connections, which, on his account are the only connections “which can lead us beyond the immediate impressions of our memory and senses” (Hume THN, 89). But the difference between such inferences and what we know today as induction is largely a matter of terminology. Secondly, Hume divides all reasoning into demonstrative, by which he means deductive, and probabilistic, by which he means the generalization of causal reasoning. In what follows we paraphrase and interpolate freely so as to ease the application of the argument in contemporary contexts.

It should also be remarked that Hume's argument applies just to enumerative induction, and primarily to singular predictive inference, but, again, its generalization to other forms of inductive reasoning is straightforward.

The argument should be seen against the background of Hume's project as he announces it in the introduction to the Treatise: This project is the development of the empirical science of human nature. The epistemological sector of this science involves describing the operations of the mind, the interactions of impressions and ideas and the function of the liveliness that constitutes belief. But this cannot be a merely descriptive endeavor; accurate description of these operations entails also a considerable normative component, for, as Hume puts it, “[o]ur reason [to be taken here quite generally, to include the imagination] must be consider'd as a kind of cause, of which truth is the natural effect; but such-a-one as by the irruption of other causes, and by the inconstancy of our mental powers, may frequently be prevented” (Hume THN, 180). The account must thus not merely describe what goes on in the mind, it must also do this in such a way as to show that and how these mental activities lead naturally, if with frequent exceptions, to true belief. (See Loeb 2006 for further discussion of these questions.)

Now as concerns the argument, its conclusion is that in induction (causal inference) experience does not produce the idea of an effect from an impression of its cause by means of the understanding or reason, but by the imagination, by “a certain association and relation of perceptions.” The center of the argument is a dilemma: If inductive conclusions were produced by the understanding, inductive reasoning would be based upon the premise that nature is uniform; “that instances of which we have had no experience, must resemble those of which we have had experience, and that the course of nature continues always uniformly the same.” (Hume THN, 89) And were this premise to be established by reasoning, that reasoning would be either deductive or probabilistic (i.e. causal). The principle can't be proved deductively, for whatever can be proved deductively is a necessary truth, and the principle is not necessary; its antecedent is consistent with the denial of its consequent. Nor can the principle be proved by causal reasoning, for it is presupposed by all such reasoning and any such proof would be a petitio principii.

The normative component of Hume's project is striking here: That the principle of uniformity of nature cannot be proved deductively or inductively shows that it is not the principle that drives our causal reasoning only if our causal reasoning is sound and leads to true conclusions as a “natural effect” of belief in true premises. This is what licenses the capsule description of the argument as showing that induction cannot be justified or licensed either deductively or inductively; not deductively because (non-trivial) inductions do not express logically necessary connections, not inductively because that would be circular. If, however, causal reasoning were fallacious, the principle of the uniformity of nature might well be among its principles.

The negative argument is an essential first step in Hume's general account of induction. It rules out accounts of induction that view it as the work of reason. Hume's positive account begins from a constructive dilemma: Inductive inference must be the work either of reason or of imagination.; Since the negative argument shows that it cannot be a species of reasoning, it must be imaginative.

Hume's positive account of causal inference can be simply described: It amounts to embedding the singular form of enumerative induction in the nature of human, and at least some bestial, thought. The several definitions offered in (Hume EHU, 60) make this explicit:

[W]e may define a cause to be an object, followed by another, and where all objects similar to the first are followed by objects similar to the second. Or, in other words, where, if the first object had not been, the second never had existed.

Another definition defines a cause to be:

an object followed by another, and whose appearance always conveys the thought to that other.

If we have observed many Fs to be followed by Gs, and no contrary instances, then observing a new F will lead us to anticipate that it will also be a G. That is causal inference.

It is clear, says Hume, that we do make inductive, or, in his terms, causal, inferences; that having observed many Fs to be Gs, observation of a new instance of an F leads us to believe that the newly observed F is also a G. It is equally clear that the epistemic force of this inference, what Hume calls the necessary connection between the premises and the conclusion, does not reside in the premises alone:

All observed Fs have also been Gs,

and

a is an F,

do not imply

a is a G.

It is false that “instances of which we have had no experience must resemble those of which we have had experience” (Hume EHU, 89).

Hume's view is that the experience of constant conjunction fosters a “habit of the mind” that leads us to anticipate the conclusion on the occasion of a new instance of the second premise. The force of induction, the force that drives the inference, is thus not an objective feature of the world, but a subjective power; the mind's capacity to form inductive habits. The objectivity of causality, the objective support of inductive inference, is thus an illusion, an instance of what Hume calls the mind's “great propensity to spread itself on external objects” (Hume THN, 167).

It is important to distinguish in Hume's account causal inference from causal belief: Causal inference does not require that the agent have the concept of cause; animals may make causal inferences (Hume THN, 176–179; Hume EHU, 104–108) which occur when past experience of constant conjunction leads to the anticipation of the subsequent conjunct upon experience of the precedent. Causal beliefs, on the other hand, beliefs of the form

A causes B,

may be formed when one reflects upon causal inferences as, presumably, animals cannot (Hume THN, 78).

Hume's account raises the problem of induction in an acute form: One would like to say that good and reliable inductions are those that follow the lines of causal necessity; that when

All observed Fs have also been Gs,

is the manifestation in experience of a causal connection between F and G, then the inference

All observed Fs have also been Gs,
a is an F,
Therefore, a, not yet observed, is also a G,

is a good induction. But if causality is not an objective feature of the world this is not an option. The Humean problem of induction is then the problem of distinguishing good from bad inductive habits in the absence of any corresponding objective distinction.

Two sides or facets of the problem of induction should be distinguished: The epistemological problem is to find a method for distinguishing good or reliable inductive habits from bad or unreliable habits. The second and deeper problem is metaphysical. This is the problem of saying what the difference is between reliable and unreliable inductions. This is the problem that Whitehead called “the despair of philosophy” (Whitehead 1948, 35). The distinction can be illustrated in the parallel case of arithmetic. The by now classic incompleteness results of the last century show that the epistemological problem for first-order arithmetic is insoluble; that there can be no method, in a quite clear sense of that term, for distinguishing the truths from the falsehoods of first-order arithmetic. But the metaphysical problem for arithmetic has a clear and correct solution: the truths of first-order arithmetic are precisely the sentences that are true in all arithmetic models. Our understanding of the distinction between arithmetic truths and falsehoods is just as clear as our understanding of the simple recursive definition of truth in arithmetic, though any method for applying the distinction must remain forever out of our reach.

Now as concerns inductive inference, it is hardly surprising to be told that the epistemological problem is insoluble; that there can be no formula or recipe, however complex, for ruling out unreliable inductions. But Hume's arguments, if they are correct, have apparently a much more radical consequence than this: They seem to show that the metaphysical problem for induction is insoluble; that there is no objective difference between reliable and unreliable inductions. This is counterintuitive. Good inductions are supported by causal connections and we think of causality as an objective matter: The laws of nature express objective causal connections. Ramsey writes in his Humean account of the matter:

Causal laws form the system with which the speaker meets the future; they are not, therefore, subjective in the sense that if you and I enunciate different ones we are each saying something about ourselves which pass by one another like “I went to Grantchester”, “I didn't” (Ramsey 1931, 241).

A satisfactory resolution of the problem of induction would account for this objectivity in the distinction between good and bad inductions.

It might seem that Hume's argument succeeds only because he has made the criteria for a solution to the problem too strict. Enumerative induction does not realistically lead from premises

All observed Fs have also been Gs
a is an F,

to the simple assertion

Therefore, a, not yet observed, is also a G.

Induction is contingent inference and as such can yield a conclusion only with a certain probability. The appropriate conclusion is

It is therefore probable that, a, not yet observed, is also a G.

Hume's response to this (Hume THN, 89) is to insist that probabilistic connections, no less than simple causal connections, depend upon habits of the mind and are not to be found in our experience of the world. Weakening the inferential force between premises and conclusion may divide and complicate inductive habits, it does not eliminate them. The laws of probability alone have no more empirical content than does deductive logic. If I infer from observing clouds followed by rain that today's clouds will probably be followed by rain this can only be in virtue of an imperfect habit of associating rain with clouds. This account is treated in more detail below.

Hume is also the progenitor of one sort of theory of inductive inference which, if it does not pretend to solve the metaphysical problem, does offer an at least partial account of reliability. We consider this tradition below in section 8.1.

2.1. Induction and its justification

Hume's argument is often credited with raising the problem of induction in its modern form. For Hume himself the conclusion of the argument is not so much a problem as a principle of his account of induction: Inductive inference is not and could not be reasoning, either deductive or probabilistic, from premises to conclusion, so we must look elsewhere to understand it. Hume's positive account, discussed in sections 5.3 and 8.3 below, does much to alleviate the epistemological problem—how to distinguish good inductions from bad ones—without treating the metaphysical problem. His account is based on the principle that inductive inference is the work of association which forms a “habit of the mind” to anticipate the consequence, or effect, upon witnessing the premise, or cause. He provides illuminating examples of such inferential habits in sections I.III.XI and I.III.XII of (Hume THN). The latter accounts for frequency-to-probability inferences in a comprehensive way. It shows that and how inductive inference is “a kind of cause, of which truth is the natural effect.”

Although Hume is the progenitor of modern work on induction, induction presents a problem, indeed a multitude of problems, quite in its own right. The by now traditional problem is the matter of justification: How is induction to be justified? There are in fact several questions here, corresponding to different modes of justification. One very simple mode is to take Hume's dilemma as a challenge, to justify (enumerative) induction one should show that it leads to true or probable conclusions from true premises. It is safe to say that in the absence of further assumptions this problem is and should be insoluble. The realization of this dead end and the proliferation of other forms of induction have led to more specialized projects involving various strengthened premises and assumptions. The several approaches treated below exemplify this.

Hume's dilemma also sponsors a much more sweeping challenge: Neither deduction nor induction can give reason to trust induction, so what reason is there to trust it at all? Why, in particular, trust induction rather than other methods of fixing belief? Why not consult sacred writings, the pronouncement of authorities or “the wisdom of crowds” to explain and predict the movements of the planets, the weather, automotive breakdowns or the evolution of species? We return to these and related questions in section 8.4.

3. Verification, Confirmation, and the Paradoxes of Induction

3.1 Verifiability and confirmation

The verifiability criterion of meaning was essential to logical positivism (see the section on verificationism in the entry the Vienna Circle). In its first and simplest form the criterion said just that the meaning of a synthetic statement is the method of its empirical verification. (Analytic statements were held to be logically verifiable.) The point of the principle was to class metaphysical statements as meaningless, since such statements (Kant's claim that noumenal matters are beyond experience was a favored example) could obviously not be empirically verified. This initial formulation of the criterion was soon seen to be too strong; it counted as meaningless not only metaphysical statements but also statements that are clearly empirically meaningful, such as that all copper conducts electricity and, indeed, any universally quantified statement of infinite scope, as well as statements that were at the time beyond the reach of experience for technical, and not conceptual, reasons, such as that there are mountains on the back side of the moon. These difficulties led to modification of the criterion: The latter to allow empirical verification if not in fact then at least in principle, the former to soften verification to empirical confirmation. So, that all copper conducts electricity can be confirmed, if not verified, by its observed instances. Observation of successive instances of copper that conduct electricity in the absence of counterinstances supports or confirms that all copper conducts electricity, and the meaning of “all copper conducts electricity” could thus be understood as the experimental method of this confirmation.

Empirical confirmation is inductive, and empirical confirmation by instances is a sort of enumerative induction. The problem of induction thus gains weight, at least in the context of modern empiricism, for induction now founds empirical meaning: to show that a statement is empirically meaningful we describe a good induction which, were the premises true, would confirm it. “There are mountains on the other side of the moon” is meaningful (in 1945) because space flight is possible in principle and the inference from

Space travelers observed mountains on the other side of the moon,

to

There are mountains on the other side of the moon,

is a good induction. “Copper conducts electricity” is meaningful because the inference from

Many observed instances of copper conduct and none fail to conduct,

to

All copper conducts,

is a good induction.

3.2 Some inductive paradoxes

That enumerative induction is a much subtler and more complex process than one might think is made apparent by the paradoxes of induction. The paradox of the ravens is a good example: By enumerative induction:

a is a raven and is black,

confirms (to some small extent)

All ravens are black.

That is just a straightforward application of instance confirmation. But the same rule allows that

a is non-black and is a non-raven,

confirms (to some small extent)

All non-black things are non-ravens.

The latter is logically equivalent to “all ravens are black”, and hence “all ravens are black” is confirmed by the observation of a white shoe (a non-black, non-raven). But this is a bad induction, and this case of enumerative induction looks to be unsound.

The paradox resides in the conflict of this counterintuitive result with our strong intuitive attachment to enumerative induction, both in everyday life and in the methodology of science. This conflict looks to require that we must either reject enumerative induction or agree that the observation of a white shoe confirms “all ravens are black”.

The (by now classic) resolution of this dilemma is due to C.G. Hempel (Hempel 1945) who credits discussion with Nelson Goodman. Assume first that we ignore all the background knowledge we bring to the question, such as that there are very many things that are either ravens or are not black, and that we look strictly at the truth-conditions of the premise (this is a white shoe) and the supported hypothesis (all ravens are black). The hypothesis says (is equivalent to)

Everything is either a black raven or is not a raven.

This hypothesis divides the world into three exclusive and exhaustive classes of things: non-black ravens, black ravens, and things that are not ravens. Any member of the first class falsifies the hypothesis. Each member of the other two classes confirms it. A white shoe is a member of the third class and is thus a confirming instance.

If this seems implausible it is because we in fact do not, as assumed, ignore the background knowledge that we bring to the question. We know before considering the inference that there are some black ravens and that there are many more non-ravens, many of which are not black. Observing a white shoe thus tells us nothing about the colors of ravens that we don't already know, and since induction is ampliative, good inductions should increase our knowledge. If we did not know that many non-ravens are not black, the observation of a white shoe would increase our knowledge.

On the other hand, we don't know whether any of the unobserved ravens are not black, i.e., whether the first and falsifying class of things has any members. Observing a raven that is black tells us that this object at least is not a falsifying instance of the hypothesis, and this we did not know before the observation.

As Goodman puts it, the paradoxical inference depends upon “tacit and illicit evidence” not stated in its formulation:

Taken by itself, the statement that the given object is neither black nor a raven confirms the hypothesis that everything that is not a raven is not black as well as the hypothesis that everything that is not black is not a raven. We tend to ignore the former hypothesis because we know it to be false from abundant other evidence — from all the familiar things that are not ravens but are black. (Goodman 1955, 72)

The important lesson of the paradox of the ravens and its resolution is that inductive inference, because it is ampliative, is sensitive to background information and context. What looks to be a good induction when considered in isolation turns out not to be so when the context, including background knowledge, is taken into account. The inductive inference from

a is a white shoe,

to

All ravens are black,

is not so much unsound as it is uninteresting and uninformative.

More recent discussion of the paradox continues and improves on the Hempel — Goodman account by making explicit, and thus licit, the suppressed evidence. (See, for example, Maher 1999 for a proposal of this sort in a Carnapian framework.) Further development, along vaguely Bayesian lines, generalizes the earlier approach by defining comparative (A confirms H better than does B) and quantitative (A confirms H to degree p) concepts of confirmation capable of differentiating support for the two hypotheses in question. (Fitelson and Hawthorne 2010) is an encyclopedic account of these efforts and includes also a comprehensive bibliography.

There are however other faulty inductions that look not to be accounted for by reference to background information and context:

Albert is in this room and is safe from freezing,

confirms

Everyone in this room is safe from freezing,

but

Albert is in this room and is a third son,

does not confirm

Everyone in this room is a third son,

and no amount of background information seems to explain this difference. The distinction is usually marked by saying that “Everyone in this room is safe from freezing” is a lawlike generalization, while “Everyone in this room is a third son” is an accidental generalization. But this distinction amounts to no more than that the first is confirmed by its instances while the second is not, so it cannot very well be advanced as an account of that difference. The problem is raised in a pointed way by Nelson Goodman's famous grue paradox (Goodman 1955, 73–75). (See (Norton, 2006), (Olson, 2006) and the entry on formal learning theory for recent commentary on the paradox.)

Grue Paradox:
Suppose that at time t we have observed many emeralds to be green. We thus have evidence statements
Emerald a is green,
Emerald b is green,
etc.

and these statements support the generalization:

All emeralds are green.

But now define the predicate “grue” to apply to all things observed before t just in case they are green, and to other things just in case they are blue. Then we have also the evidence statements

Emerald a is grue,
Emerald b is grue,
etc.

and these evidence statements support the hypothesis

All emeralds are grue.

Hence the same observations support incompatible hypotheses about emeralds to be observed in the future; that they will be green and that they will be blue.

A few cautionary remarks about this frequently misunderstood paradox:

  1. No one thinks that the grue hypothesis is well supported. The paradox makes it clear that there is something wrong with instance confirmation and enumerative induction as initially characterized.
  2. Neither the grue evidence statements nor the grue hypothesis entails that any emeralds change color. This is a common confusion; see, for examplem Armstrong 1983, 58; and Nix & Paris 2007, 36).
  3. The grue paradox cannot be resolved, as was the raven paradox, by looking to background knowledge (as would be the case if it entailed color changes). Of course we know that it is extremely unlikely that any emeralds are grue. That just restates the point of the paradox and does nothing to resolve it.
  4. That the definition of “grue” includes a time parameter is sometimes advanced as a criticism of the definition. But, as Goodman remarks, were we to take “grue” and its obverse “bleen” (“blue up to t, green thereafter”) instead of “green” and “blue” as primitive terms, definitions of the latter would include time parameters (“green” =def “grue if observed before t and bleen if observed thereafter”). The question here is whether inductive inference should be relative to the language in which it is formulated. Deductive inference is relative in this way as is Carnapian inductive logic.

3.3 Confirmation and deductive logic

Induction helps us to localize our actual world among all the possible worlds. This is not to say that induction applies only in the actual world: The premises of a good induction confirm its conclusion whether those premises are true or false in the actual world. This leads to a few principles relating confirmation and deduction. If A and B are true in the same possible worlds, then whatever A confirms also confirms B and whatever confirms B also confirms A:

Equivalence principle:
If A confirms B then any logical equivalent of A confirms any logical equivalent of B.

(We appealed to this principle in stating the paradox of the ravens above.) A second principle follows from the truth that if B logically implies C then every subset of the B worlds is also a subset of the C worlds:

Implicative principle:
If A confirms B, then A confirms every logical consequence of B.

But we do not have that whatever implies A confirms whatever A confirms:

That a presidential candidate wins the state of New York confirms that he will win the election.

That a candidate wins New York and loses California and Texas does not confirm that he will win the election, though “wins New York and loses California and Texas” logically implies “wins New York”.

This marks an important contrast between confirmation and logical implication, between induction and deduction. Logical implication is transitive: whatever implies a proposition implies all of its logical consequences, for implication corresponds to the transitive subset relation among sets of worlds. But when A implies B and B confirms C, the B worlds in which C is true may (as in the example) exclude the A worlds. Inductive reasoning is said to be non-monotonic, for in contrast to deduction, the addition of premises may annul what was a good induction (the inference from the premise P to the conclusion R be may be inductively strong while the inference from the premises P, Q to the conclusion R may not be). (See the entry on non-monotonic logic, and section 7.1 below for a striking example.) For this reason induction and confirmation are subject to the principle of total evidence which requires that all relevant evidence be taken into account in every induction. No such requirement is called for in deduction; adding premises to a valid deduction can never make it invalid.

Yet another contrast between induction and deduction is revealed by the lottery paradox. (See section 3.3 of the entry on conditionals.) If there are many lottery tickets sold, just one of which will win, each induction from these premises to the conclusion that a given ticket will not win is a good one. But the conjunction of all those conclusions is inconsistent with the premises, for some ticket must win. Thus good inductions from the same set of premises may lead to conclusions that are conjunctively inconsistent. This paradox is at least softened by some theories of conditionals (e.g., Adams 1975).

4. Induction, Causality, and Laws of Nature

What we know as the problem of enumerative induction Hume took to be the problem of causal knowledge, of identifying genuine causal regularities. Hume held that all ampliative knowledge was causal and from this point of view, as remarked above, the problem of induction is narrower than the problem of causal knowledge so long as we admit that some ampliative knowledge is not inductive. On the other hand, we now think of causal connection as being a particular kind of contingent connection and of inductive reasoning as having a wider application, including such non-causal forms as inferring the distribution of a trait in a population from its distribution in a sample from that population.

4.1 Causal inductions

Causal inductions are a significant subclass of inductions. They form a problem, or a constellation of problems, of induction in their own right. One of the classic twentieth century accounts of the problem of induction, that of Nelson Goodman (Goodman 1955), focuses on enumerative inductions that support causal laws. Goodman argued that three forms of the problem of enumerative induction turn out to be equivalent. These were: (1) Supporting subjunctive and contrary to fact conditionals; (2) Establishing criteria for confirmation that would not stumble on the grue paradox; and (3) Distinguishing lawlike hypotheses from accidental generalizations. (A sentence is lawlike if it is like a law of nature with the possible exception of not being true.) Put briefly, a counterfactual is true if some scientific law permits inference of its consequent from its antecedent, and lawlike statements are confirmed by their instances. Thus

If Nanook of the north were in this room he would be safe from freezing,

is a true counterfactual because the law

If the temperature is well above freezing then the residents are safe from freezing,

(along with background information) licenses inference of the consequent

Nanook is safe from freezing,

from the antecedent

Nanook is in this room.

On the other hand, no such law supports a counterfactual like

If my only son were in this room he would be a third son.

Similarly, the lawlike statement

Everyone in this room is safe from freezing.

is confirmed by the instance

Nanook is in this room and is safe from freezing,

whereas

Everyone in this room is a third son,

even if true is not lawlike since instances do not confirm it. Goodman's formulation of the problem of (enumerative) induction thus focused on the distinction between lawlike and accidental generalizations. Generalizations that are confirmed by their instances Goodman called projectible. In these terms projectability ties together three different questions: lawlikeness, counterfactuals, and confirmation. Goodman also proposed an account of the distinction between projectible and unprojectible hypotheses. Very roughly put, this is that projectible hypotheses are made up of predicates that have a history of use in projections.

4.2 Karl Popper's views on induction

One of the most influential and controversial views on the problem of induction has been that of Karl Popper, announced and argued in (Popper LSD). Popper held that induction has no place in the logic of science. Science in his view is a deductive process in which scientists formulate hypotheses and theories that they test by deriving particular observable consequences. Theories are not confirmed or verified. They may be falsified and rejected or tentatively accepted if corroborated in the absence of falsification by the proper kinds of tests:

[A] theory of induction is superfluous. It has no function in a logic of science.

The best we can say of a hypothesis is that up to now it has been able to show its worth, and that it has been more successful than other hypotheses although, in principle, it can never be justified, verified, or even shown to be probable. This appraisal of the hypothesis relies solely upon deductive consequences (predictions) which may be drawn from the hypothesis: There is no need even to mention “induction” (Popper LSD, 315).

Popper gave two formulations of the problem of induction; the first is the establishment of the truth of a theory by empirical evidence; the second, slightly weaker, is the justification of a preference for one theory over another as better supported by empirical evidence. Both of these he declared insoluble, on the grounds, roughly put, that scientific theories have infinite scope and no finite evidence can ever adjudicate among them (Popper LSD, 253–254, Grattan-Guiness 2004). He did however hold that theories could be falsified, and that falsifiability, or the liability of a theory to counterexample, was a virtue. Falsifiability corresponds roughly to to the proportion of models in which a (consistent) theory is false. Highly falsifiable theories thus make stronger assertions and are in general more informative. Though theories cannot in Popper's view be supported, they can be corroborated: a better corroborated theory is one that has been subjected to more and more rigorous tests without having been falsified. Falsifiable and corroborated theories are thus to be preferred, though, as the impossibility of the second problem of induction makes evident, these are not to be confused with support by evidence.

Popper's epistemology is almost exclusively the epistemology of scientific knowledge. This is not because he thinks that there is a sharp division between ordinary knowledge and scientific knowledge, but rather because he thinks that to study the growth of knowledge one must study scientific knowledge:

[M]ost problems connected with the growth of our knowledge must necessarily transcend any study which is confined to common-sense knowledge as opposed to scientific knowledge. For the most important way in which common-sense knowledge grows is, precisely, by turning into scientific knowledge (Popper LSD, 18).

5. Probability and Induction

So far only straightforward non-probabilistic forms of the problem of induction have been surveyed. The addition of probability to the question is not only a generalization; probabilistic induction is much deeper and more complex than induction without probability. The following subsections look at several different approaches: Rudolf Carnap's inductive logic, Hans Reichenbach's frequentist account, Bruno de Finetti's subjective Bayesianism, likelihood methods, and the Neyman-Pearson method of hypothesis testing.

5.1 Carnap's inductive logic

Carnap's classification of inductive inferences (Carnap LFP, ¶44) will be generally useful in discussing probabilistic induction. He lists five sorts:

  1. Direct inference typically infers the relative frequency of a trait in a sample from its relative frequency in the population from which the sample is drawn. The sample is said to be unbiased to the extent that these frequencies are the same. If the incidence of lung disease among all cigarette smokers in the U.S. is 0.15, then it is reasonable to predict that the incidence among smokers in California is close to that figure.
  2. Predictive inference is inference from one sample to another sample not overlapping the first. This, according to Carnap, is “the most important and fundamental kind of inductive inference” (Carnap LFP, 207). It includes the special case, known as singular predictive inference, in which the second sample consists of just one individual. Inferring the color of the next ball to be drawn from an urn on the basis of the frequency of balls of that color in previous draws with replacement illustrates a common sort of predictive inference.
  3. Inference by analogy is inference from the traits of one individual to those of another on the basis of traits that they share. Hume's famous arguments that beasts can reason, love, hate, and be proud or humble (Hume THN, I.III.16, II.I.12, II.II.12) are classic instances of analogy. Disagreements about racial profiling come down to disagreements about the force of certain analogies.
  4. Inverse inference infers something about a population on the basis of premises about a sample from that population. Again, that the sample be unbiased is critical. The use of polls to predict election results, of controlled experiments to predict the efficacy of therapies or medications, are common examples.
  5. Universal inference is inference from a sample to a hypothesis of universal form. Simple enumerative induction, mentioned in the introduction and in section 3, is the standard sort of universal inference. Karl Popper's objections to induction, mentioned in section 4, are for the most part directed against universal inference. Popper and Carnap are less opposed than it might seem in this regard: Popper holds that universal inference is never justified. On Carnap's view it is inessential.

5.1.1 Carnapian confirmation theory

Note: Readers are encouraged to read section 3.2 of the entry interpretations of probability in conjunction with the remainder of this section. See also (Zabell 2007) for a thorough discussion of Carnapian induction.

Carnap initially held that the problem of confirmation was a logical problem; that assertions of degree of confirmation by evidence of a hypothesis should be analytic and depend only upon the logical relations of the hypothesis and evidence.

Carnapian induction concerns always the sentences of a language as characterized in section 3.2 of interpretations of probability. The languages in question here are assumed to be interpreted, i.e. the referents of the non-logical constants are fixed, and identity is interpreted normally. A set of sentences of such a language is consistent if it has a model in which all of its members are true. A set is maximal if it has no consistent proper superset in the language. (So every inconsistent set is maximal.) The language in question is said to be finite if it includes just finitely many maximal consistent subsets. Each maximal consistent (m.c.) set says all that can be said about some possible situation described in the language in question. The m.c. sets are thus a precise way of understanding the notion of case that is critical in the classical conception of probability (interpretations of probability section 3.1).

Much of the content of the theory can be illustrated, as is done in interpretations of probability, in the simple case of a finite language £ including just one monadic predicate, S (signifying a successful outcome of a repeated experiment such as draws from an urn), and just finitely many individual constants, a1, …, ar, signifying distinct trials or draws.

There will in this case be 2r conjunctions S′(a1) ∧ … ∧ S′(ar), where S′(ai) is either S(ai) (success on the ith trial) or its negation ¬S(ai). These are the state descriptions of £ . Each maximal consistent set of £ will consist of the logical consequences of one of the state descriptions, so there will be 2r m.c. sets. Thus, pursuing the affinity with the classical conception, the probability of a sentence e is just the ratio

m(e) = n/2r

where n is the number of state descriptions that imply e (interpretations of probability, section 3.2). c-functions generalize logical implication. In the finite case a sentence e logically implies a sentence h if the collection of m.c. sets each of which includes e is a subset of those that include h. The extent to which e confirms h is just the ratio of the number of m.c. sets including he to the number of those including e. This is the proportion of possible cases in which e is true in which h is also true.

In this simple example, state descriptions are said to be isomorphic when they include the same number of successes. A structure description is a maximal disjunction of isomorphic state descriptions. In the present example, a structure description says how many trials have successful outcomes without saying which trials these are. (See interpretations of probability, section 3.2 for examples.)

Confirmation functions all satisfy two additional qualitative logical constraints: They are regular, which, in the case of a finite language means that they assign positive value to every state description, and they are also symmetrical. A function on £ is symmetrical if it is invariant for thorough permutations of the individual constants of £. That is to say, if the names of objects are switched around the values of c and m are unaffected. State descriptions that are related in this way are isomorphic. “(W)e require that logic should not discriminate between the individuals but treat them all on a par; although we know that individuals are not alike, they ought to be given equal rights before the tribunal of logic” (Carnap LFP, 485).

Although regularity and symmetry do not determine a unique confirmation function, they nevertheless suffice to derive a number of important results concerning inductive inferences. In particular, in the simple case of a finite language with one predicate, S, these constraints entail that state descriptions in the same structure description (with the same relative frequency of success) must always have the same m value. And if dk and ek are sequences giving outcomes of trials 1, . . . , k (k < r) with the same number of Ss,

c(S(k + 1), dk) = c(S(k + 1), ek)

In the three-constant language of interpretations of probability, c (S3 | S1S2) = ½ for all values of S1 and S2; c is completely unaffected by the evidence:

c (S3 | S1S2) = ½
c (S3 | ¬S1 ∧ ¬S2) = ½
c (S3 | S1 ∧ ¬S2) = ½

This strong independence led Carnap to reject c in favor of c*. This is the function that he endorsed in (Carnap LFP) and that is illustrated in interpretations of probability. c* gives equal weight to each structure description. Symmetry assures that the weight is equally apportioned to state descriptions within a structure description. c* thus weighs uniform state descriptions, those in which one sort of outcome predominates, more heavily than those in which outcomes are more equally apportioned. This effect diminishes as the number of trials or individual constants increases.

(Carnap 1952) generalized the approach of (Carnap LFP) to construct an infinite system of inductive methods. This is the λ system. The fundamental principle of the λ system is that degree of confirmation should give some weight to the purely empirical character of evidence and also some weight to the logical structure of the language in question. (c* does this.) The λ system consists of c-functions that are mixtures of functions that give total weight to these extremes. See the discussion in (interpretations of probability, section 3.2).

Two points, both mentioned in interpretations of probability (section 3.2), should be emphasized: 1. Carnapian confirmation is invariant for logical equivalence within the framing language. Logical equivalence may however outrun epistemic equivalence, particularly in complex languages. The tie of confirmation to knowledge is thus looser than one might hope. 2. Degree of confirmation is relative to a language. Thus the degree of confirmation of a hypothesis by evidence may differ when formulated in different languages.

(Carnap LFP, 569) also includes a first effort at characterizing analogical inference. Analogies are in general stronger when the objects in question share more properties. This rough statement suffers from the lack of a method for counting properties; without further precision about this, it looks that any two objects must share infinitely many properties. What is needed is some way to compare properties in the right way. Carnap's proposal depends upon characterizing the strongest consistent monadic properties expressible in a language. Given a finite language £ including only distinct and logically independent monadic predicates, each conjunctive predicate including for each atomic predicate either it or its negation is a Q-predicate. Q-predicates are the predicative analogue of state descriptions. Any sentence formed by instantiating a Q-predicate with an individual constant throughout is thus a consistent and logically strongest description of that individual. Every monadic property expressed in £ is equivalent to a disjunction of unique Q-predicates, and the width of a property is just the number of Q-predicates in this disjunction. The width of properties corresponds to their weakness in an intuitive sense: The widest property is the tautological property, no object can fail to have it. The narrowest (consistent) properties are the Q-properties.

Let

ρbc be the conjunction of all the properties that b and c are known to share;
ρb be the conjunction of all the properties that b is known to have.

So ρb implies ρbc and the analogical inference in question is

b has ρb
b and c both have ρbc
c has ρb

Let wbc) and wb) be the widths of ρbc and ρb respectively. (So in the non-trivial case wbc) < wb).)

It follows from the above that

c*(c has ρb, b and c have ρbc) = [wbc) + 1] / [wb) + 1]

Now as the proportion of known properties of b shared by c increases, this quantity also increases, which is as it should be.

Although the theory does provide an account of analogical inference in simple cases, in more complicated cases, in which the analogy depends upon the similarity of different properties, it is, as it stands, insufficient. In later work Carnap and others developed an account of similarity to overcome this. See the critical remarks in (Achinstein 1963) and Carnap's response in the same issue.

5.2 Reichenbach's frequentism

5.2.1 Reichenbach's theory of probability

Section 3.3 of interpretations of probability, as well as section 2.3 of the entry on Reichenbach should be read in conjunction with this section.

Carnap's logical probability generalized the metalinguistic relation of logical implication to a numerical function, c(h, e), that expresses the extent to which an evidence sentence e confirms a hypothesis h. Reichenbach's probability implication is also a generalization of a deductive concept, but the concept generalized belongs first to an object language of events and their properties. (Reichenbach's logical probability, which defines probabilities of sentences, is briefly discussed below.) Russell and Whitehead in (Whitehead 1957, vol I, 139) wrote

ρxx φx

which they called “formal implication”, to abbreviate

(x)(ρx ⊃ φx)

Reichenbach's generalization of this extends classical first-order logic to include probability implications. These are formulas (Reichenbach TOP, 45)

xAp xB

where p is some quantity between zero and one inclusive. Probability implications may be abbreviated

Ap B

In a more conventional notation this probability implication between properties or classes may be written

P(B | A) = p

(There are a number of differences from Reichenbach's notation in the present exposition. Most notably he writes P(A, B) rather than P(B | A). The latter is written here to maintain consistency with the notations of other sections.) Russell and Whitehead were following Peano (Peano SWP, 193) who, though he lacked fully developed quantifiers, had nevertheless the notions of formal implication and bound and free variables on which the Principia notation depends. In the modern theory free variables are read as universally quantified with widest scope, so the subscripted variable is redundant and the notation has fallen into disuse. (See Vickers 1988 for a general account of probability quantifiers including Reichenbachean conditionals.)

Reichenbach's probability logic is a conservative extension of classical first-order logic to include rules for probability implications. The individual variables (x, y) are taken to range over events (“The gun was fired”, “The shot hit the target”) and, as the notation makes evident, the variables A and B range over classes of events (“the class of firings by an expert marksman”, “the class of hits within a given range of the bullseye”) (Reichenbach TOP, 47). The formal rules of probability logic assure that probability implications conform to the laws of conditional probability and allow inferences integrating probability implications into deductive logic, including higher-order quantifiers over the subscripted variables.

Reichenbach's rules of interpretation of probability implications require, first, that the classes A and B be infinite and in one-one correspondence so that their order is established. It is also required that the limiting relative frequency

limn→∞ N(AnBn) / n

where An, Bn are the first n members of A, B respectively, and N gives the cardinality of its argument, exists. When this limit does exist it defines the probability of B given A (Reichenbach 1971, 68):

P(B | A) =def limn→∞ N(AnBn) / n when the limit exists.

The complete system also includes higher-order, or, as Reichenbach calls them concatenated, probabilities. First-level probabilities involve infinite sequences; the ordered sets referred to by the predicates of probability implications. Second-order probabilities are determined by lattices, or sequences of sequences. Here is a simplified sketch of this (Reichenbach 1971, chapter 8; Reichenbach 1971, ¶41).

b11 b12 b1j limn→∞[N(B1nC) / n] = p1
b21 b22 b2j limn→∞[N(B2nC) / n] = p2
bi1 bi2 bij limn→∞[N(BinC) / n] = pi

All the bij are members of B, some of which are also members of C. Each row i gives a sequence of members of B:

{bi} = {bi1, bi2, … }

Where Bin is the sequence

Bin = {bi1, bi2, …, bin}

of the first n members of the sequence {bi}, we assume that the limit, as n increases without bound, of the proportion of these that are also members of C,

limn→∞[N(BinC) / n]

exists for each row. Hence each row determines a probability, pi :

Pi(C | B) = limn→∞[N(BinC) / n] = pi

Now let {ai} be a sequence of members of the set A and consider the sequence of pairs

{<a1, p1>, <a2, p2>, …, <ai, pi>, … }

Let p be some quantity between zero and one inclusive. For given m the proportion of pi in the first m members of this sequence that are equal to p is

[Ni ≤ m(pi = p) / m]

Suppose that the limit of this quantity as m increases without bound exists and is equal to q:

limm→∞[Ni ≤ m(pi = p) / m] = q

We may then identify q as the second order probability given A that the probability of C given B is p:

P{[P(C | B) = p] | A} = q

The method permits higher order probabilities of any finite degree corresponding to matrices of higher dimensions. It is noteworthy that Reichenbach's theory thus includes a logic of expectations of probabilities and other random variables.

Before turning to Reichenbach's account of induction, there are three questions about the interpretation of probability to consider. These are

1. The problem of extensionality. The values of the variables in Reichenbach's theory are events and ordered classes of events. The theory is in these respects extensional; probabilities do not depend on how the classes and events of their arguments are described or intended:

If A = A′ and B = B′ then P(xB | xA) = P(xB′ | xA′)
If x = x′ and y = y′ then P(xA | yB) = P(x′ ∈ A | y′ ∈ B)

But probability attributions are intensional, they vary with differences in the ways classes and events are described. The class of examined green things is also the class of examined grue things, but the role of these predicates in probabilistic inference should be different. Less exotic examples are easy to come by. Here is an inference that depends upon extensionality:

The next toss = the next head ⇒
P(x is a head | x = the next toss) = P(x is a head | x = the next head) = 1
The next toss = the next tail ⇒
P(x is a head | x = the next toss) = P(x is a head | x = the next tail) = 0

Since (The next toss = the next head) or (The next toss = the next tail),

P(x is a head | x = the next toss) = 1 or P(x is a head | x = the next toss) = 0

To block this inference one should have to block replacing “the next toss” by “the next head” and “the next toss” by “the next tail” within the scope of the probability operator, but extensionality of that operator allows just these replacements. Reichenbach seems not to appreciate this difficulty.

2. The problem of infinite sequences. This is the problem of the application of the definition of probability, which presumes infinite sequences for which limits exist, to actual cases. In the world of our experience sequences of events are finite. This looks to entail that there can be no true statements of the form P(B | A) = p.

The problem of infinite sequences is a consequence of a quite general problem about reference to infinite totalities; such totalities cannot be given in extension and require always some intensional way of being specified. This leaves the extensionality of probability untouched, however, since there is no privileged intension; the above argument continues to hold. Reichenbach distinguishes two ways in which classes can be specified; extensionally, by listing or pointing out their members, and intensionally, by giving a property of which the class is the extension. Classes specified intensionally may be infinite. Some classes may be necessarily finite; the class of organisms, for example, is limited in size by the quantity of matter in the universe; but in some of these cases the class may be theoretically, or in principle, infinite. Such a class may be treated as if it were infinite for the purposes of probabilistic inference. Although our experience is limited to finite subsets of these classes, we can still consider theoretically inifinite extensions of them.

3. The problem of single case probabilities. Probabilities are commonly attributed to single events without reference to sequences or conditions: The probability of rain tomorrow; the probability that Julius Caesar was in Britain in 55 BCE, seem not to involve classes.

From a frequentist point of view, single case probabilities are of two sorts. In the first sort the reference class is implicit. Thus, when we speak of the probability of rain tomorrow, we take the suppressed reference class to be days following periods that are meteorologically similar to the present period. These are then treated as standard frequentist probabilities. Single case probabilities of this sort are hence ambiguous; for shifts in the reference class will give different single case probabilities. This ambiguity, sometimes referred to as the problem of the reference class, is ubiquitous; different classes A will give different values for P(B | A). This is not so much a shortcoming as it is a fact of inductive life and probabilistic inductive inference. Reichenbach's principle governing the matter is that one should always use the smallest reference class for which reliable statistics are known. This principle has the same force as the Carnapian requirement of total evidence.

In other cases, the presence of Julius Caesar in Britain is an example, there seems to be no such reference class. To handle such cases Reichenbach introduces logical probabilities defined for collections of propositions or sentences. The notion of truth-value is generalized to allow a continuum of weights, from zero to one inclusive. These weights conform to the laws of probability, and in some cases may be calculated with respect to sequences of propositions. The probability statement will then be of the form

P(xB | xA) = p

where A is a reference class of propositions (those asserted by Caesar in The Gallic Wars, for example) and B is the true subclass of these.

This account of single-case probabilities obviously depends essentially upon testimony, not to amplify and expand the reach of induction, but to make induction possible.

Reichenbach's account of single-case probabilities contrasts with subjectivistic and logical views, both of which allow the attribution of probabilities to arbitrary propositions or sentences without reference to classes. In the Carnapian case, given a c-function the probability of every sentence in the language is fixed. In subjectivistic theories the probability is restricted only by coherence and the probabilities of other sentences.

5.2.2 Reichenbachian induction

On Reichenbach's view, the problem of induction is just the problem of ascertaining probability on the basis of evidence (Reichenbach TOP, 429). The conclusions of inductions are not asserted, they are posited. “A posit is a statement with which we deal as true, though the truth value is unknown” (Reichenbach TOP, 373).

Reichenbach divides inductions into several sorts, not quite parallel to the Carnapian taxonomy given earlier. These are:

Induction by enumeration, in which an observed initial frequency is posited to hold for the limit of the sequence;

Explanatory inference, in which a theory or hypothesis is inferred from observations;

Cross induction, in which distinct but similar inductions are compared and, perhaps, corrected;

Concatenation or hierarchical assignment of probabilities.

These all resolve to the first—induction by enumeration—in ways to be discussed below. The problem of induction (by enumeration) is resolved by the inductive rule, also known as the straight rule:

If the relative frequency of B in A = N(AnBn) / n is known for the first n members of the sequence A and nothing is known about this sequence beyond n, then we posit that the limit limn→∞[ N(AnBn) / n] will be within a small increment δ of N(AnBn) / n.

(This corresponds to the Carnapian λ-function c0 (λ(κ) = 0) which gives total weight to the empirical factor and no weight to the logical factor. See interpretations of probability, 3.2.)

We saw above how concatenation works. It is a sort of induction by enumeration that amounts to reiterated applications of the inductive rule. Cross induction is a variety of concatenation. It amounts to evaluating an induction by enumeration by comparing it with similar past inductions of known character. Reichenbach cites the famous example of inferring that all swans are white from many instances. A cross induction will list other inductions on the invariability of color among animals and show them to be unreliable. This cross induction will reveal the unreliability of the inference even in the absence of counterinstances (black swans found in Australia). So concatenation, or hierarchical induction, and cross induction are instances of induction by enumeration.

Explanatory inference is not obviously a sort of induction by enumeration. Reichenbach's version (Reichenbach TOP, ¶85) is ingenious and too complex for summary here. It depends upon concatenation and the approximation of universal statements by conditional probabilities close to 1.

Reichenbach's justification of induction by enumeration is known as a pragmatic justification. (See also Salmon 1967, 52–54.) It is first important to keep in mind that the conclusion of inductive inference is not an assertion, it is a posit. Reichenbach does not argue that induction is a sound method, his account is rather what Salmon (Salmon 1963) and others have referred to as vindication: that if any rule will lead to positing the correct probability, the inductive rule will do this, and it is, furthermore, the simplest rule that is successful in this sense.

What is now the standard difficulty with Reichenbach's rule of induction was noticed by Reichenbach himself and later strengthened by Wesley Salmon (Salmon 1963). It is that for any observed relative frequency in an initial segment of any finite length, and for any arbitrarily selected quantity between zero and one inclusive, there exists a rule that leads to that quantity as the limit on the basis of that observed frequency. Salmon goes on to announce additional conditions on adequate rules that uniquely determine the rule of induction. More recently Cory Juhl (Juhl, 1994) has examined the rule with respect to the speed with which it approaches a limit.

5.3 Subjectivism and Bayesian induction: de Finetti

Section 3 of the article Bayes' theorem should be read in conjunction with this section.

5.3.1 Subjectivism

Bruno de Finetti (1906–1985) is the founder of modern subjectivism in probability and induction. He was a mathematician by training and inclination, and he typically writes in a sophisticated mathematical idiom that can discourage the mathematically naïve reader. In fact, the deep and general principles of de Finetti's theory, and in particular the structure of the powerful representation theorem, can be expressed in largely non-technical language with the aid of a few simple arithmetical principles. De Finetti himself insists that “questions of principle relating to the significance and value of probability [should] cease to be isolated in a particular branch of mathematics and take on the importance of fundamental epistemological problems,” (de Finetti FLL, 99) and he begins the first chapter of the monumental “Foresight” by inviting the reader to “consider the notion of probability as it is conceived by us in everyday life” (de Finetti FLL, 100).

Subjectivism in probability identifies probability with strength of belief. Hume was in this respect a subjectivist: He held that strength of belief in a proposition was the proportion of assertive force that the mind devoted to the proposition. He illustrates this with the famous example of a six-sided die (Hume THN, 127–130), four faces of which bear one mark and the other two faces of which bear another mark. If we see the die in the air, he says, we can't avoid anticipating that it will land with some face upwards, nor can we anticipate any one face landing up. In consequence the mind divides its force of anticipation equally among the faces and conflates the force directed to faces with the same mark. This is what constitutes a belief of strength 2/3 that the die will land with one mark up, and 1/3 that it will land with the other mark up.

There are three evident difficulties with this account. First is the unsatisfactory identification of belief with mental force, whether divided or not. It is, outside of simple cases like the symmetrical die, not at all evident that strength of feeling is correlated with strength of belief; some of our strongest beliefs are, as Ramsey says (Ramsey 1931, 169), accompanied by little or no feeling. Second, even if it is assumed that strength of feeling entails strength of belief, it is a mystery why these strengths should be additive as Hume's example requires. Finally, the principle according to which belief is apportioned equally among exclusive and exhaustive alternatives is not easy to justify. This is known as the principle of indifference, and it leads to paradox if unrestricted. (See interpretations of probability, section 3.1.) The same situation may be partitioned into alternative outcomes in different ways, leading to distinct partial beliefs. Thus if a coin is to be tossed twice we may partition the outcomes as

2 Heads, 2 Tails, (Heads on 1 and Tails on 2), (Tails on 1 and Heads on 2)

which, applying the principle of indifference yields P(2 Heads) = 1/4

or as

Zero Heads, One Head, Two Heads

which yields P(2 Heads) = 1/3.

Carnap's c-functions c* and c, mentioned in section 5.1 above, provide a more substantial example: c counts the state descriptions as alternative outcomes and c* counts the structure descriptions as outcomes. They assign different probabilities. Indeed, the continuum of inductive methods can be seen as a continuum of different applications of the principle of indifference.

These difficulties with Hume's mentalistic view of strength of belief have led subjectivists to associate strength of belief not with feelings but with actions, in accordance with the pragmatic principle that the strength of a belief corresponds to the extent to which we are prepared to act upon it. Bruno de Finetti announced that “PROBABILITY DOES NOT EXIST!” in the beginning paragraphs of his Theory of Probability (de Finetti TOP). By this he meant to deny the existence of objective probability and to insist that probability be understood as a set of constraints on partial belief. In particular, strength of belief is taken to be expressed in betting odds: If you will put up p dollars (where, for example, p = 0.25) to receive one dollar if the event A occurs and nothing (forfeiting the p dollars) if A does not occur, then your strength of belief in A is p. If £ is a language like that sketched above, the sentences of which express events, then a belief system is given by a function b that gives betting odds for every sentence in £. Such a system is said to be coherent if there is no set of bets in accordance with it on which the believer must lose. It can be shown (this is the “Dutch Book Theorem”) that all and only coherent belief systems satisfy the laws of probability. (See interpretations of probability, section 3.5.2, and section 3 of the entry on Bayesian epistemology as well as the supplement to the latter on Dutch Book arguments for comprehensive discussions.) The Dutch Book Theorem provides a subjectivistic response to the question of what probability has to do with partial belief; namely that the laws of probability are minimal laws of calculative rationality. If your partial beliefs don't conform to them then there is a set of bets all of which you will accept and on which your gain is negative in every possible world.

As just cited the Dutch Book Theorem is unsatisfactory: It is clear, at least since Jacob Bernoulli's Ars Conjectandi in 1713 that the odds at which a reasonable person will bet vary with the size of the stake: A thaler is worth more to a pauper than to a rich man, as Bernoulli put it. This means that in fact betting systems are not determined by monetary odds. Subjectivists have in consequence taken strength of belief to be given by betting odds when the stakes are measured not in money but in utility. (See interpretations of probability, section 3.5.3.) Frank Ramsey was the first to do this in (Ramsey 1926, 156–198). Leonard J. Savage provided a more sophisticated axiomatization of choice in the face of uncertainty (Savage 1954). These, and later, accounts, such as that of Richard Jeffrey (Jeffrey LOD) still face critical difficulties, but the general principle that associates coherent strength of belief with probability remains a fundamental postulate of subjectivism.

5.3.2 Bayesian induction

Of the five sorts of induction mentioned above (section 5.1), de Finetti is concerned explicitly only with predictive inference, though his account applies as well to direct and inverse inference. He ignores analogy, and he holds that no particular premises can support a general hypothesis. The central question of induction is, he says, “if a prediction of frequency can be, in a certain sense, confirmed or refuted by experience. … [O]ur explanation of inductive reasoning is nothing else, at bottom than the knowledge of … the probability of En + 1 evaluated when the result A of [trials] E1, …, En is known” (de Finetti 1964, 119). That is to say that for de Finetti, the singular predictive inference is the essential inductive inference.

One conspicuous sort of inverse inference concerns relative frequencies. Suppose, for example, from an urn containing balls each of which is red or black, we are to draw (with replacement) three balls. What should our beliefs be before drawing any balls? The classical description of this situation is that the draws are independent with unknown constant probability, p, of drawing a red ball. (Such probabilities are known as Bernoullian probabilities, recalling that Jacob Bernoulli based the law of large numbers on them.) Since the draws are independent, the probability of drawing a red on the second draw given a red on the first draw is

P(R2 | R1) = P(R2) = p

where p is an unknown probability. Notice that Bernoullian probabilities are invariant for variations in the order of draws: If A(n, k) and B(n, k) are two sequences of length n each including just k reds, then

b[A(n, k)] = b[B(n, k)] = pk(1 − p)(nk)

De Finetti, and subjectivists in general, find this classical account unsatisfactory for several reasons. First, the reference to an unknown probability is, from a subjectivistic point of view, unintelligible. If probabilities are partial beliefs, then ignorance of the probability would be ignorance of my own beliefs. Secondly, it is a confusion to suppose that my beliefs change when a red ball is drawn. Induction from de Finetti's point of view is not a process for changing beliefs. Induction proceeds from reducing uncertainty in prior beliefs about certain processes.

[T]he probability of En+1 evaluated when one comes to know the result A of [trials] E1, …, En is not an element of an essentially novel nature (justifying the introduction of a new term, like “statistical” or “a posteriori” probability.) This probability is not independent of the “a priori probability” and does not replace it; it flows in fact from the same a priori judgment, by subtracting, so to speak, the components of doubt associated with the trials whose results have been obtained (de Finetti FLL, 119, 120).

In the important case of believing the probability of an event to be close to the observed relative frequency of events of the same sort, we learn that certain initial frequencies are ruled out. It is thus critical to understand the nature of initial uncertainty and initial dispositional beliefs, i.e., initial dispositions to wager.

De Finetti approaches the problem of inverse inference by emphasizing a fundamental feature of our beliefs about random processes like draws from an urn. This is that, as in the Bernoullian case, our beliefs are invariant for sequences of the same length with the same relative frequency of success. For each n and kn our belief that there will be k reds in n trials is the same regardless of the order in which the reds and blacks occur. Probabilities (partial beliefs) of this sort are exchangeable.[1] If b(n, k) is our prior belief that n trials will yield k reds in some order or other then, since there are

( n
k
) = n! / k!(nk)!

distinct sequences of length n with k reds, the mean or average probability of k reds in n trials is given by the prior belief divided by this quantity:

b(n, k) / ( n
k
)

and in the exchangeable case, in which sequences of the same length and frequency of reds are equiprobable, this is the probability of each sequence of this sort. Hence, where b gives prior belief and A(n, k) is any given sequence including k reds and nk blacks;

b[A(n, k)] = b(n, k) / ( n
k
)

In an important class of subcases we might have specific knowledge about the constitution of the urn that can lead to further refinement of exchangeable beliefs. If, for example, we know that there are just three balls in the urn, each either red or black, then there are four exclusive hypotheses incorporating this information:

H0: zero reds, three blacks
H1: one red, two blacks
H2: two reds, one black
H3: three reds zero blacks

Let the probabilities of these hypotheses be h0, h1, h2, h3, respectively. Of course in the present example

b(Rj | H0) = 0
b(Rj | H3) = 1

for each j. Now if A(n, k) is any individual sequence of k reds and nk blacks, then, since the Hi are exclusive and exhaustive hypotheses,

b[A(n, k)] = i b[A(n, k) ∧ Hi] = i b[A(n, k) | Hi]hi

In the present example each of the conditional probabilities b[  | Hi] represents draws from an urn of known composition. These are just Bernoullian probabilities with probability of success (red):

b(Rj | H0) = 0
b(Rj | H1) = 1/3
b(Rj | H2) = 2/3
b(Rj | H3) = 1

b (and this is true of exchangeable probabilities in general) is thus conditionally Bernoullian. If we write

pi(X) = b[X | Hi]

then for each sequence A(n, k) including k reds in n draws,

pi[A(k, n)] = pi(Rj)k[1 − pi(Rj)](nk)

we see that b is a mixture or weighted average of Bernoullian probabilities where the weights, summing to one, are the hi.

b(X) = i pi(X)hi

5.3.3 The de Finetti Representation Theorem (finite case)

This is a special case of de Finetti's representation theorem. The general statement of the finite form of the theorem is:

If b is any exchangeable probability on finite sequences of a random phenomenon then b is a finite mixture of Bernoullian probabilities on those sequences.

It is easy to see that exchangeable probabilities are closed under finite mixtures: Let b and c be exchangeable, m and n positive quantities summing to one, and let

f = mb + nc

be the mixture of b and c with weights m and n. Then if A and B are sequences of length n each of which includes just k reds:

mb(A) = mb(B), nc(A) = nc(B)
mb(A) + nc(A) = mb(B) + nc(B)
f(A) = f(B)

Hence since, as mentioned above, all Bernoullian probabilities are exchangeable, every finite mixture of Bernoullian probabilities is exchangeable.

To see how the representation theorem works in induction, let us take the Hi to be equiprobable, so hi = 1/4 for each i. (We'll see that this assumption diminishes in importance as we continue to draw and replace balls.) Then for each j,

b(Rj) = (1/4)[(0) + (1/3) + (2/3) + 1] = 1/2

and

b(R2 | R1) = (1/4)[∑i pi(R1R2) / (1/4)[∑i pi(R1)]
= [0 + (1/9) + (4/9) + 1] / [0 + (1/3) + (2/3) + 1]
= (14/9) / 2
= 7/9

thus updating by taking account of the evidence R1. In this way exchangeable probabilities take account of evidence, by, in de Finetti's phrase, “subtracting, so to speak, the components of doubt associated with the trials whose results have been obtained”.

Notice that R1 and R2 are not independent in b:

b(R2) = 1/2 ≠ b(R2 | R1) = 7/9

so b is not Bernoullian. Hence, though all mixtures of Bernoullian probabilities are exchangeable, the converse does not hold: Bernoullian probabilities are not closed under mixtures, for b is the mixture of the Bernoullian probabilities pi but is not itself Bernoullian. This reveals the power of the concept of exchangeability: The closure of Bernoullian probabilities under mixtures is just the totality of exchangeable probabilities.

We can also update beliefs about the hypotheses Hi. By Bayes' law (See the article Bayes' Theorem and section 5.4.1 on likelihoods below) for each j:

b(Hj | R1) = b(R1 | Hj)hj / ∑i b(R1 | Hi)hi

so

b(H0 | R1) = 0
b(H1 | R1) = (1/3)(1/4) / (2/3)(1/4) + (1)(1/4)
= (1/12) / (1/12) + (2/12) + (3/12)
= (1/12) / (1/2) = 1/6
b(H2 | R1) = (2/3)(1/4) / (1/2) = (2/12) / (1/2) = 1/3
b(H3 | R1) = (1)(1/4) / (1/2) = 1/2

Thus the initial assumption of the flat or “indifference” measure for the hi loses its influence as evidence grows.

We can see de Finettian induction at work by representing the three-ball problem in a tetrahedron:

tetrahedron showing three-ball experiment

Each point in this solid represents an exchangeable measure on the sequence of three draws. The vertices mark the pure Bernoullian probabilities, in which full weight is given to one or another hypothesis Hi. The indifference measure that assigns equal probability 1/4 to each hypothesis is the center of mass of the tetrahedron. As we draw successively (with replacement) from the urn, updating as above, exchangeable beliefs, given by the conditional probabilities

b[R(n + 1) | A(n, k)]

move within the solid. Drawing a red on the first draw puts beliefs before the second draw in the plane bounded by H1, H2, and H3 at a point corresponding to the weights h1 = 1/6, h2 = 1/3 and h3 = 1/2. If a black is drawn on the second draw then, conditioning on the evidence (R1B2)

b(H0 | R1B2) = 0

By Bayes' theorem,

b(H1 | R1B2) = b(R1B2 | H1)h1  /  b(R1B2 | H1)h1 + b(R1B2 | H2)h2

b(H2 | R1B2) = b(R1B2 | H2)h2  /  b(R1B2 | H1)h1 + b(R1B2 | H2)h2

Now b( |H1) and b( |H2) are Bernoullian, so

b(R1 ∧ B2 | H1) = b(B2 | H1)b(R1 | H1)

and

b(R1 ∧ B2 | H2) = b(B2 | H2)b(R1 | H2)

Since h1 = h2

b(H1 | R1B2) = (2/3)(1/3) / (2/3)(1/3) + (1/3)(2/3)
= 1/2
b(H2 | R1B2) = 1/2

Beliefs are now at the midpoint of the line connecting H1 and H2. Continued draws will move conditional beliefs along this line. Suppose now that we continue to draw with replacement, and that A(n,k), with increasing n, is the sequence of draws. Maintaining exchangeability and updating assures that as the number n of draws increases without bound, conditional beliefs

b[R(n + 1) | A(k, n)]

are practically certain to converge to one of the Bernoullian measures

b(R | Hi)

The Bayesian method thus provides a solution to the problem of induction as de Finetti formulated it.

5.3.4 Exchangeability

We gave a definition of exchangeability: Every sequence of the same length with the same frequency of reds has the same probability. In fact, for given k and n, this probability is always equal to the probability of k reds followed by nk blacks,

b(R1, …, Rk, Bk+1, …, Bn) = b(n, k) / ( n
k
)

(where b(n, k) = the probability of k reds in n trials, in some order or other) for, in the exchangeable case, probability is invariant for permutations of trials. There are alternative definitions: First, it follows from the first definition that

b(R1, …, Rn) = b(n, n)

and this condition is also sufficient for exchangeability. Finally, if the concept of exchangeability is extended to random variables we have that a sequence {xi} of random variables is exchangeable if for each n the mean μ(x1, …, xn) is the same for every x1, …, xn. See the Supplement on Basic Probability.

The above urn example consists of an objective system—an urn containing balls—that is known. Draws from such an urn are random because the precise prediction of the outcomes is very difficult, if not impossible, due to small perturbing causes (the irregular agitation of the balls) not under our control. But in the three-ball example, because there are just four possible contents, described in the four hypotheses, the perturbations don't affect the fact that there are just eight possible outcomes. As the number of balls increases we add hypotheses, but the basic structure remains; our beliefs continue to be exchangeable and the de Finetti representation theorem assures that the probability of drawing k reds in n trials is always expressed in a formula

b(n, k) = i hi {pi(R)k[1 − pi(R)](nk)}

where the hi give the probabilities of the hypotheses Hi. In the simple urn example, this representation has the very nice property that its components match up with features of the objective urn system: Each value of pi corresponds to a constitution of the urn in which the proportion of red balls is pi, and each hi is the probability of that constitution as described in the hypothesis Hi. Epistemically, the pi are, as we saw above, conditional probabilities:

pi(X) = b(X | Hi)

that express belief in X given the hypothesis Hi about the constitution.

The critical role of the objective situation in applications of exchangeability becomes clear when we reflect that, as Persi Diaconis puts it, to make use of exchangeability one must believe in it. We must believe in a foundation of stable causes (solidity, number, colors of the balls; gravity) as well as in a network of variable and accidental causes (agitation of the balls, variability in the way they are grasped). There are, in Hume's phrase, “a mixture of causes among the chances, and a conjunction of necessity in some particulars, with a total indifference in others” (Hume THN, 125f.). It is this entire objective system that supports exchangeability. The fundamental causes must be stable and constant from trial to trial. The variable and accidental causes should operate independently from trial to trial. To underscore this Diaconis gives the example of a basketball player practicing shooting baskets. Since his aim improves with continued practice, the frequency of success will increase and the trials will not be exchangeable; the fundamental causes are not stable. Indeed, de Finetti himself warns that “In general different probabilities will be assigned, depending on the order; whether it is supposed that one toss has an influence on the one which follows it immediately, or whether the exterior circumstances are supposed to vary” (de Finetti FLL, 121).

We count on the support of objective mechanisms even when we cannot formulate even vague hypotheses about the stable causes that constitute it. De Finetti gives the example of a bent coin, deformed in such a way that before experimenting with it we have no idea of its tendency to fall heads. In this case our prior beliefs are plausibly represented by a “flat” distribution that gives equal weight to each hypothesis, to each quantity in the [0, 1] interval. The de Finetti theorem says that in this case the probability of k heads in n tosses is

b(n, k) = ( n
k
) pk(1 − p)(nk)f(p)d(p)

where f(p) gives the weights of the different Bernoullian probabilities (hypotheses) p. We may remain ignorant about the stable causes (the shape and distribution of the mass of the coin, primarily) even after de Finetti's method applied to continued experiments supports conditional beliefs about the strength of the coin's tendency to fall heads. We may insist that each Bernoullian probability, each value for p, corresponds to a physical configuration of the coin, but, in sharp contrast to the urn example, we can say little or nothing about the causes on which exchangeability depends. We believe in exchangeability because we believe that whatever those causes are they remain stable through the trials while the variable causes (such as the force of the throw) do not.

5.3.5 Meta-inductions

Suppose that you are drawing with replacement from an urn containing a thousand balls, each either red or black, and that you modify beliefs according to the de Finetti formula

b[R(k + 1) | A(k, n)] = i hi[b(A(k, n) | Rj)b(Rj | Hi)]

where the hi give the probabilities of the updated 1001 hypotheses about the constitution of the urn. Suppose, however, that unbeknownst to you each time a red ball is drawn and replaced a black ball is withdrawn and replaced with a red ball. (This is a variation of the Polya urn in which each red ball drawn is replaced and a second red ball added.)

Without going into the detailed calculation it is evident that your exchangeable beliefs are in this example not supported. To use exchangeability one must believe in it, and to use it correctly, one might add, that belief must be true; de Finettian induction requires a prior assumption of exchangeability.

Obviously no sequence of reds and blacks could provide evidence for the hypothesis of exchangeability without calling it into question; exchangeability entails that any sequence in which the frequency of reds increases with time has the same probability as any of its permutations. The assumption is however contingent and ampliative and should be subject to inductive support. It is worth recalling Kant's thesis, that regularity of succession in time is the schema, the empirical manifestation, of causal connection. From this point of view, exchangeability is a precise contrary of causality, for its “schema”, its manifestation, is just the absence of regularity of succession, but with constant relative frequency of success. The hypothesis of exchangeability is just that the division of labor between the stable and the variable causes is properly enforced; that the weaker force of variable causes acting in the stable setting of fundamental causes varies order without varying frequency. In the case of gambling devices and similar mechanisms we can provide evidence that the fundamental and determining causes are stable: We can measure and weigh the balls, make sure that none are added or removed between trials, drop the dice in a glass of water, examine the mechanism of the roulette wheel. In less restricted cases—aircraft and automobile accidents, tables of mortality, consumer behavior—the evidence is much more obscure and precarious.

5.3.6 Uncertain Evidence and Jeffrey's Probability Kinematics

Bayesian induction has traditionally taken inductive inference to consist in updating prior beliefs, beliefs before taking account of evidence, on the basis of that evidence. A simple rule of this sort is:

Updating:
If E is the evidence observed between t0 and t1, then for each proposition X, b1(X) = b0 (X | E)

De Finetti, as we've seen, resists distinguishing prior from posterior beliefs. Updating for him amounts just to using b(X | E) as belief in X after observation of E removes “components of doubt” from E.

Updating in this form has the following difficulty: If E is the observed evidence then Updating implies that b1(E) = b0(E | E) = 1. Evidence, that is to say, becomes certain. De Finetti's account at least flirts with this difficulty: If doubt is removed, then E becomes certain. But evidence is often if not typically uncertain. Here is Jeffrey's example.

The agent inspects a piece of cloth by candlelight, and gets the impression that it is green, although he concedes that it might be blue or even (but very improbably) violet. If G, B and V are the propositions that the cloth is green, blue or violet respectively, then the outcome of the observation might be that, whereas originally his degrees of belief in G, B and V were .30, .30 and .40, his degrees of belief in those same propositions after his observation are .70, .25 and .05. (Jeffrey FLL, 165)

Here there seems to be no evidence E such that b1(G) = b0(G | E).

Jeffrey's resolution of this problem is to take account not only of the support provided by the uncertain evidence E for a hypothesis H, but also of the support for H provided by ¬E, the negation of E.

Jeffrey conditionalization:
b1(H) = b0(H | E)b1(E) + b0(H | ¬E)b1E)

Jeffrey conditionalization is a consequence of the principle that conditional beliefs should not change from t0 to t1; in particular that

b1(H | E) = b0(H | E) and b1(H | ¬E) = b0(H | ¬E)

together with the truth

b1(H) = b1(H | E)b1(E) + b1(H | ¬E)b1E)

See (Jeffrey FLL, chapter 11) and Section 6.2 of Bayesian epistemology for more complete accounts.

5.4 Testing statistical hypotheses

A statistical hypothesis states the distribution of some random variable. (See the supplementary document Basic Probability for a brief description of random variables.) The support of statistical hypotheses is thus an important sort of inductive inference, a sort of inverse inference. In a wide class of cases the problem of induction amounts to the problem of formulating good conditions for accepting and rejecting statistical hypotheses. Two specific approaches to this question are briefly surveyed here; the method of likelihood ratios and that of Neyman-Pearson statistics. Likelihood can be given short shrift since it is treated in depth and detail in the article on inductive logic. General methodological questions about sampling and the separation of effects are ignored here. What follows are brief descriptions of the inferential structures.

Logical, frequentist, and subjectivistic views of induction presuppose specific accounts of probability. Accounts of hypothesis testing on the other hand do not typically include specific theories of probability. They presume objective probabilities but they depend only upon the commonly accepted laws of probability and upon classical principles relating probabilities and frequencies.

5.4.1 Likelihood ratios and the law of likelihood

If h is a hypothesis and e an evidence statement then the likelihood of h relative to e is just the probability of e conditional upon h:

L(h | e) = P(e | h)

Likelihoods are in some cases objective. If the hypothesis implies the evidence then it follows from the laws of probability that the likelihood L(h | e) is one. Even when not completely objective, likelihoods tend to be less relative than the corresponding confirmation values: If we draw a red ball from an urn of unknown constitution, we may have no very good idea of the extent to which this evidence confirms the hypothesis that 2/3 of the balls in the urn are red, but we don't doubt that the probability of drawing a red ball given the hypothesis is 2/3. (See inductive logic, section 3.1.)

Isolated likelihoods are not good indicators of inductive support; e may be highly probable given h without confirming h. (If h implies e, for example, then the likelihood of h relative to e is 1, but P(h | e) may be very small.) Likelihood is however valuable as a method of comparing hypotheses: The likelihood ratio of hypotheses g and h relative to the same evidence e is the quotient

L(g | e) / L(h | e)

Likelihood ratios may have any value from zero to infinity inclusive. The law of likelihood says roughly that if L(g | e) > L(h | e) then e supports g better than it does h. (See section 3.2 of the article on inductive logic for a more precise formulation.)

The very general intuition supporting the method of likelihood ratios is just inference to the best explanation; accept that hypothesis among alternatives that best accounts for the evidence. Likelihoods figure importantly in Bayesian inverse inference.

5.4.2 Significance tests

Likelihood ratios are a way of comparing competing statistical hypotheses. A second way to do this consists of precisely defined statistical tests. One simple sort of test is common in testing medications: A large sample of people with a disease is treated with a medication. There are then two contradictory hypotheses to be evaluated in the light of the results:

h0: The medication has no effect. (This is the null hypothesis.)
h1: The medication has some curative effect. (This is the alternative hypothesis.)

Suppose that the known probability of a spontaneous cure, in an untreated patient, is pc, that the sample of treated patients has n members, and that the number of cures in the sample is ke. Suppose further that sampling has been suitably randomized so that the sample of n members (before treatment) has the structure of n draws without replacement from a large population. If the diseased population is very large in comparison with the size n of the sample, then draws without replacement are approximated by draws with replacement and the sample can be treated as a collection of independent and equiprobable trials. In this case, if C is a group of n untreated patients, for each k between zero and n inclusive the probability of k cures in C is given by the binomial formula:

P(k cures in C) = b(n, k, pc)
= ( n
k
) pck(1 − pc)(nk)

If the null hypothesis, h0, is true we should expect the probability of k cures in the sample to be the same:

P(k cures in the sample | h0) = P(k cures in C)
= b(n, k, pc)
= ( n
k
) pck(1 − pc)(nk)

Let kc = pcn. This is the expected number of spontaneous cures in n untreated patients. If h0 is true and the medication has no effect, ke (the number of cures in the medicated sample) should be close to kc and the difference

kekc

(known as the observed distance) should be small. As k varies from zero to n the random variable

kkc

takes on values from −kc to nkc with probabilities

b(n, 0, pc), b(n, 1, pc), …, b(n, n, pc)

This binomial distribution has its mean at k = kc, and this is also the point at which b(n, k, pc) reaches its maximum. A histogram would look something like this.

histogram showing distribution of k-kc
Distribution of kkc

Given pc and n, this distribution gives the probability that the observed distance has the different possible sizes between its minimum, −kc, and its maximum at nkc; probabilities of the different values of kkc are on the abscissa. The significance level of the test is the probability given h0 of a distance as large as the observed distance.

A high significance level means that the observed distance is relatively small and that it is highly likely that the difference is due to chance, i.e. that the probability of a cure given medication is the same as the probability of a spontaneous, unmedicated, cure. In specifying the test an upper limit for the significance level is set. If the significance level exceeds this limit, then the result of the test is confirmation of the null hypothesis. Thus if a low limit is set (limits on significance levels are typically 0.01 or 0.05, depending upon cost of a mistake) it is easier to confirm the null hypothesis and not to accept the alternative hypothesis. Caeteris paribus, the lower the limit the more severe the test; the more likely it is that P(cure | medication) is close to pe = ke / n.

This is not the place for an extended methodological discussion, but one simple principle, obvious upon brief reflection, should be mentioned. This is that the size n of the sample must be fixed in advance. Else a persistent researcher could, with arbitrarily high probability, obtain any ratio pe = ke / n and hence any observed difference kekc desired; for, in the case of Bernoulli trials, for any frequency p the probability that at some n the frequency of cures will be p is arbitrarily close to one.

5.4.3 Power, size, and the Neyman-Pearson lemma

If h is any statistical hypothesis a test of h can go wrong in either of two ways: h may be rejected though true—this is known as a type I error; or it may be accepted though false—this is a type II error.

If f is a (one-dimensional) random variable that takes on values in some interval of the real line with definite probabilities and h is a statistical hypothesis that determines a probability distribution over the values of f, then a pure statistical test of h specifies an experiment that will yield a value for f and specifies also a region of values of fthe rejection region of the test. If the result of the experiment is in the rejection region, then the hypothesis is rejected. If the result is not in the rejection region, the hypothesis is not rejected. A mixed statistical test of a hypothesis h includes a pure test but in addition divides the results not in the rejection region into two sub-regions. If the result is in the first of these regions the hypothesis is not rejected. If the result is in the second sub-region a further random experiment, completely independent of the first experiment, but with known prior probability of success, is performed. This might be, for example, drawing a ball from an urn of known constitution. If the outcome of the random experiment is success, then the hypothesis is not rejected, otherwise it is rejected. Hypotheses that are not rejected may not be accepted, but may be tested further. This way of looking at testing is quite in the spirit of Popper. Recall his remark that

The best we can say of a hypothesis is that up to now it has been able to show its worth, and that it has been more successful than other hypotheses although, in principle, it can never be justified, verified, or even shown to be probable. This appraisal of the hypothesis relies solely upon deductive consequences (predictions) which may be drawn from the hypothesis … (Popper LSD, 315)

A hypothesis that undergoes successive and varied statistical tests shows its worth in this way. Popper would not call this process “induction”, but statistical tests are now commonly taken to be a sort of induction.

Given a statistical test of a hypothesis h two critical probabilities determine the merit of the test. The size of the test is the probability of a type I error; the probability that the hypothesis will be rejected though true; and the power of the test is the chance of rejecting h if it is false. A good test will have small size and large power.

size = Prob(reject h and h is true)
power = Prob(reject h and h is false)

The Fundamental Lemma of Neyman-Pearson asserts that for any statistical hypothesis of any given size, there is a unique test of maximum power (known as a best test of that size). The best test may be a mixed test, and this is sometimes said to be counterintuitive: A mixed test (tossing a coin, drawing a ball from an urn) may, as Mayo puts it, “even be irrelevant to the hypothesis of interest” (Mayo 1996, 390). Mixed tests bear an uncomfortable resemblance to consulting tea leaves. Indeed, recent exponents of the Neyman-Pearson approach favor versions the theory that do not depend on mixed tests (Mayo 1996, 390 n.).

5.5 Formal learning theory

Formal learning theory formulates the problem of induction in general terms as the question of how an agent should use empirical data to confirm and reject hypotheses about the world. In specific instances the theory sets goals of inquiry and compares methods for pursuing those goals. (See the early parts of the entry on the topic for an introduction to the approach. The comparison of the methods embodied in the hypotheses ‘All emeralds are green’ and ‘All emeralds are grue’ is a striking example that reveals the basic workings of the theory.)

Formal learning theory, like many other inductive methods, seeks deductive proof of the reliability of chosen inductive methods. (This effort is discussed in section 8.3 below.)

See (Suppes 1998) for a brief, critical and laudatory appraisal of the theory.

6. Induction, Values, and Evaluation

6.1 Pragmatism: induction as practical reason

In 1953 Richard Rudner published “The Scientist qua Scientist Makes Value Judgments” in which he argued for the thesis expressed in its title. Rudner's argument was simple and can be sketched in the framework of the Neyman-Pearson model of hypothesis testing: “[S]ince no hypothesis is ever completely verified, in accepting a hypothesis the scientist must make the decision that the evidence is sufficiently strong or that the probability is sufficiently high to warrant the acceptance of the hypothesis” (Rudner 1953, 2). Sufficiency in such a decision will and should depend upon the importance of getting it right or wrong. Tests of hypotheses about drug toxicity may and should have smaller size and larger power than those about the quality of a “lot of machine stamped belt buckles”. The argument is not restricted to scientific inductions; it shows as well that our everyday inferences depend inevitably upon value judgments; how much evidence one collects depends upon the importance of the consequences of the decision.

Isaac Levi in responding to Rudner's claim, and to later formulations of it, distinguished cognitive values from other sorts of values; moral, aesthetic, and so on. (Levi 1986, 43–46) Of course the scientist qua scientist, that is to say in his scientific activity, makes judgments and commitments of cognitive value, but he need not, and in many instances should not, allow other sorts of values (fame, riches) to weigh upon his scientific inductions.

What is in question is the separation of practical reason from theoretical reason. Rudner denies the distinction; Levi does too, but distinguishes practical reason with cognitive ends from other sorts. Recent pragmatic accounts of inductive reasoning are even more radical. Following (Ramsey 1926) and (Savage 1954) they subsume inductive reasoning under practical reason; reason that aims at and ends in action. These and their successors, such as (Jeffrey LOD), define partial belief on the basis of preferences; preferences among possible worlds for Ramsey, among acts for Savage, and among propositions for Jeffrey. (See section 3.5 of interpretations of probability). Preferences are in each case highly structured. In all cases beliefs as such are theoretical entities, implicitly defined by more elaborate versions of the pragmatic principle that agents (or reasonable agents) act (or should act) in ways they believe will satisfy their desires: If we observe the actions and know the desires (preferences) we can then interpolate the beliefs. In any given case the actions and desires will fit distinct, even radically distinct, beliefs, but knowing more desires and observing more actions should, by clever design, let us narrow the candidates.

In all these theories the problem of induction is a problem of decision, in which the question is which action to take, or which wager to accept. The pragmatic principle is given a precise formulation in the injunction to act so as to maximize expected utility, to perform that action, Ai among the possible alternatives, that maximizes

U(Ai) = j P(Sj | Ai)U(SjAi)

where the Sj are the possible consequences of the acts Ai, and U gives the utility of its argument.

6.2 On the value of evidence

One significant advantage of this development is that the cost of gathering more information, of adding to the evidence for an inductive inference, can be factored into the decision. Put very roughly, the leading idea is to look at gathering evidence as an action on its own. Suppose that you are facing a decision among acts Ai, and that you are concerned only about the occurrence or non-occurrence of a consequence S. The principle of utility maximization directs you to choose that act Ai that maximizes

U(Ai) = j P(Sj | Ai)U(SjAi)

where the Sj are the possible consequences of the acts Ai and U represents utility.

Suppose further that you have the possibility of investigating to see if evidence E, for or against S, obtains. Assume further that this investigation is cost-free. Then should you investigate and find E to be true, utility maximization would direct you to choose that act Ai that maximizes utility when your beliefs are conditioned on E:

UE(Ai) = P(S | EAi)U(SEAi) + PS | EAi)USEAi)

And if you investigate and find E to be false, the same principle directs you to choose Ai to maximize utility when your beliefs are conditioned on ¬E:

    U¬E(Ai) = P(S | ¬EAi)U(S∧¬EAi) + PS | ¬EAi)US∧¬EAi)

Hence if your prior strength of belief in the evidence E is P(E), you should choose to maximize the weighted average

P(E)(UE(Ai) + PE)(U¬E(Ai)

and if the maximum of this weighted average exceeds the maximum of U(Ai) then you should investigate. About this several brief remarks:

6.3 Predictions

6.3.1 A thesis about induction and probability

This thesis has two interlocking parts: Part one is announced in the title of a paper that supports it: “Why Probability Does Not Capture the Logic of Scientific Justification” (Kelly and Glymour 2004). Part two concerns relations between computation and induction of a particular sort; the claim is that inductive inference of this sort is better understood as structured like computation than in terms of probabilities defined on Boolean (or sigma) algebras (Kelly and Schulte 1995). We look first at part two.

The sort of inference in question postulates a large finite or denumerable sequence of outcomes or trials coded as natural numbers. In the simplest case this will be a sequence of zeros and ones. (1 = green, 0 = not green, for example, where the sequence represents draws without replacement from a large collection of emeralds.) There may also be a special sign (!) to mark the end of the inquiry. The hypothesis tested by such a sequence may in the simplest case be obvious. (All emeralds are green, or some emeralds are not green, for example.) The parallels with computation are evident; a sequence of natural numbers might also be the output of a Turing machine.

Here is how Kelly and Schulte (Kelly and Schulte 1995) situate the discussion.

One intuitive distinction between algorithms and inductive methods is that the former confer certainty whereas the latter do not. This certainty derives from two factors (1) a logical guarantee that the algorithm will produce the right answer on each input in a specified class, and (2) the fact that the algorithm halts, thereby signalling to the user in an unambiguous way what its output is. Hume's problem might be expressed by saying that there is no procedure for inductive inference that has both properties. … The standard response to this difficulty is to exempt inductive inference from condition (1). (Kelly and Schulte 1995, 3)

Freed from the constraint of certainty — condition (1) — induction “looks very different from the theory of computability:” its main concern is to manage uncertainty, mostly by means of probability. This is in sharp contrast to the theory of computability, clearly and traditionally involved with the non-probabilistic methods of modern logic.

Kelly and Schulte propose a different resolution:

[R]elax condition (2) without relaxing condition (1), so that an inductive method is guaranteed to converge to the right answer, but need not inform the user when it has done so. (Kelly and Schulte 1995, 3)

(See the entry Formal Learning Theory where convergence is treated in some detail.)

When one compares induction and computation with respect to their truth conditions the contrast between them is sharp and deep; the conjectures of computation are necessarily true or necessarily false, while inductive conjectures are contingently true or contingently false. The Kelly–Schulte proposal, on the other hand, invites the comparison of induction and computation in the experience of the judging subject or phenomenologically. From this point of view they have quite similar structures; computational uncertainty doesn't feel that different from inductive uncertainty and the methodological parallels between computation and induction are clear. This conceptual shift, subsuming inductive inference under the regime of calculation, invites and supports an analogous revision in vocabulary: An inductive inference is verifiable if the hypothesis if true will be revealed as such at some time. So, some emeralds are not green is verifiable: if it is true then at some stage the coding sequence will include a zero. An inference is refutable if the hypothesis if false will be revealed as such at some time. So, all emeralds are green is refutable. An inference is decidable if both verifiable and refutable. So that the first emerald to be examined will be green is decidable. These are just the fundamental categories of computability for arithmetical functions, now applied to inductive inferences as well.

A second dimension that orders both realms — inductions and computations — classifies methods (both inductive and computational) in terms of the ease and rapidity with which they converge to solutions. A system (the application of a method to a coding sequence and a hypothesis) converges with certainty if it announces by a sign when the hypothesis in question is confirmed. The output !, mentioned above, might function in this way. A sequence converges to an output n in the limit if after some trial k every later trial yields n. And a sequence converges gradually to n if after some trial k every output is within a fixed and small rational distance of n.

It is not the least advantage of this schema that it supports parallel orderings of inductive and computational methods in the dimensions of induction and computation. There are nine stages in the two dimensions of verifiability and convergence, ranging from decidable methods that converge with certainty to refutable methods that converge gradually. Each stage applies equally to empirical and to purely computational problems Further, these stages correspond to the Kleene arithmetical hierarchy of functions (in the computational case) and to the Borel hierarchy of sets (in the empirical case).

As concerns the first part of the thesis — the inadequacy of probabilistic accounts of scientific justification — there are first some oft cited difficulties. One of these is the apparent impossibility of conditionalizing on contingent propositions of probability zero. The denumerable additivity of probability in the continuous case raises this in a pointed and critical way.[2] A second well known difficulty with Bayesian accounts is the requirement of logical transparency or omniscience. These accounts either identify probability with strength of partial belief or at least require that partial belief conform to the laws of probability. If belief is defined on the sentences of a structured language, as in interpretations of probability and sections 5.1 - 5.4 above, then the probability of every necessarily true sentence must be one. The concept of necessity at work is typically left unspecified, but it is difficult to avoid the consequence that if A implies B then the probability of A cannot exceed that of B. That is to say that strength of belief is non-decreasing through logical entailment. And this, it is argued, is unrealistic.

The major argument of (Kelly and Glymour 2004) in support of the principle that probability cannot provide an adequate account of scientific justification extends the latter difficulty. Any good account of scientific reason, they say, must classify and account for the complexity and difficulty of inductive inference. The above ordering of methods by verifiability and convergence gives an at least preliminary classification. Bayesian methods on the other hand are incapable of accounting for complexity and the interplay of conjecture and refutation - logical omniscience runs roughshod over the critical distinctions. The probability P(h|e) of a given hypothesis h conditional on changing evidence e may fluctuate from close to zero to close to one as e accumulates. This violates the central principle of convergence: A conjecture if false will be rejected at some stage, and if true will never be rejected.

The Kelly–Schulte claim cited above: “[R]elax condition (2) without relaxing condition (1), so that an inductive method is guaranteed to converge to the right answer, but need not inform the user when it has done so,” asserts one of the principal desiderata of formal learning theory. Indeed (Kelly and Schulte 1995), together with (Kelly and Glymour 2004) can well be read as motivating prolegomena to formal learning theory.

6.3.2 Prediction Games

Yet another sort of meta-induction involves prediction games. In the simplest case, as in formal learning theory, the data consist in a sequence x = x(1), x(2), … of zeros and ones, understood as coding the outcomes or trials of a process. There are also given at each trial the predictions of each of a group of experts, e1, … , ek for the next trial. The meta-inductivist, M, after trial n knows the outcomes x(1), … x(n) and knows also the past predictions and the prediction for trial n + 1 of each expert. On the basis of this information M tries to predict x(n + 1), the outcome of the next trial. The problem of induction in this setting is to find a good general method for such predictions.

Initially the talk of experts is just a picturesque way of referring to data streams or sequences, though at a later stage the experts may be thought of as embodiments of theories that issue predictions given inputs. The approach is generalized (briefly discussed below) to treat sequences of values of a real-valued variable. The exposition here follows the approach of Gerhard Schurz in a recent article (Schurz 2008)[3] which makes use of results of (Cesa-Bianchi and Lugosi 2006). Schurz's account also lends itself to computer simulations of meta-inductive methods.

Learning from the experts

One method for amalgamating the views of the experts is to follow the majority of them. A famous classical theorem — the Condorcet Jury Theorem — supports this. The import of the theorem can be expressed as follows. (See Black 1963 for a clear and accessible proof.)

Suppose that a group of people each expresses a yes-no opinion about the same matter of fact, that they reach and express these opinions independently, and that each has better than 0.5 probability of being right. Then as the size of the group increases without bound the probability that a majority will be right approaches one.

(The condition can be weakened; probabilities need not uniformly exceed 0.5. The theorem also applies to quantitative estimates in which more than two values are in question.) To see why the theorem holds, consider a very simple special case in which everyone has exactly 2/3 probability of being right. Amalgamating the opinions then corresponds to drawing once from each urn in a collection in which each urn contains two red (true) balls and one black (false) ball. The weak law of large numbers entails that as the number of urns, and hence draws, increases without bound the probability that the relative frequency of reds (or true opinions) drawn differs from 2/3 by a fixed small quantity approaches zero. (See the supplementary document Basic probability.) This also underscores the importance of the diversity requirement; if the experts all reached the same conclusion on the basis of the same sources, however independently, the conclusion would be no better supported than that reached by any one of them. And, of course, the requirement that the probabilities, or a sufficient number of them, exceed 0.5 is critical: If these probabilities are all less than 0.5 the theorem implies that a majority will be wrong in the limit.

The Condorcet Jury Theorem, as interesting as it is, can apply in only very few and special cases of prediction. What one wants are methods for finding the best performing experts with no assumptions about their competence or intentions; some experts may be malign deceivers, some may be simply ignorant. Of course no meta-method can create knowledge where none exists; an optimal meta-method is one that finds the best performing expert.

A very simple method for exploiting the predictions of the experts is to ignore all subsequent predictions of any expert who predicts wrongly at any trial and to predict the outcome predicted by a majority of the so far infallible experts for the next trial. (If the infallible experts are evenly divided, just flip a coin.)

How good is this method? Suppose, to simplify even more, that we know that one of the experts — we don't know which one or ones — is infallible. This expert predicts correctly at every trial. Now a simple argument (Cesa-Bianchi and Lugosi 2006, 4) establishes an upper bound on the number of mistakes the majority method can make as a function of the number s of experts: The essential principle is just that each time the method makes a mistake, at least half of the so far infallible experts are discarded, for at least half of those experts (those whose prediction was mimicked by the meta-method) made that same mistake. Let the total number of experts be s and the number of infallible experts remaining after the nth mistake be e(n). So e(0) = s and, since some expert is always infallible, for each n

1  ≤  e(n)   ≤  e(n − 1) / 2   ≤ ….   ≤  e(0) / 2 = k/2

1   ≤  s / 2n

2n   ≤  s

n   ≤  log2(s)

Thus the total number of mistakes can never exceed log2(s). If there is just one expert, assumed to be infallible, the method yields 0 = log2(1) mistakes. In general, if the total number s of experts is 2k for some integral k, then the bound on the number of mistakes is just k. If, for example, every possible sequence of length r represents the predictions of some expert, the upper bound on the number of mistakes in r trials is just the length of these sequences, or the number of trials. This is hardly an impressive result, nor is it intended as such. The point is just to illustrate the general framework.

One obvious shortcoming of the simple majority method is that it takes insufficient account of differences in the accuracy of the methods it exploits; the expert who has made one error in the first 100 trials has no more weight on the 101st trial than does the expert who made 100 mistakes. There is also the possibility of a counter-inductive grue-like situation in which the majority of heretofore infallible experts is always wrong, in which case the method leads to the maximum number of mistakes.

Dealing with deceivers

An obvious fix for the obvious shortcoming is to make use of the success scores and rates of the experts. We define the success score of the expert e at trial n as the number of successful predictions by e up to and including n,

SS(en) = Number of correct predictions by e at trials ≤ n

And the success rate of e at n as the ratio

sr(e, n) = SS(e, n) / n

One might then follow the leader — follow the prediction for the next trial of the expert with the best success rate up to the present trial. What success rate is guaranteed by following the leader? Zero, for the leaders at a given trial may in every case predict wrongly at the succeeding trial. Following the leader can be more reliable if there is one leader to follow; one expert who maintains a best success rate. Let us say that an expert b is best if there is a trial tb after which b's success rate is never less than that of any other expert. (Never less; best is forever.) There need be no best expert; two or more experts might continually switch the lead among them so that for every n every expert makes a mistake at some trial after n. If however there is a best expert b then the method of following the leader will assure that after tb M's truncated success rate

Number of successes after tb / ntb

will equal that of the best experts. But following the leader entails nothing about M's success rate before tb; this might be zero. It is zero, for example, if every expert who becomes a leader before tb predicts wrongly on the next trial. Hence if a best expert b makes p successful predictions between tb and n, following the leader will assure M a success rate overall of p/n (since M may have made no correct predictions up to tb). Hence as tb increases and n remains the same the lower bound on success rate guaranteed by following the leader diminishes. If tb = n − 1, the rate is 1/n.

An expert who predicts wrongly when his success rate is highest is a systematic deceiver. Systematic deceivers can be detected by conditionalizing success rates: If an expert's success rate overall is significantly higher than his success rate on trials for which his rate is highest, then he deceives systematically. If systematic deceivers are ignored then the method of following the leader can assure a success rate that approximates the maximal success rate of non-deceivers. (Schurz 2008 Theorem 3). This does not show that following the leader yields an optimal success rate; some systematic deceiver may have a higher success rate than any non-deceiver. Following the leader thus does not assure an optimal rate of success, a rate at least as good as that of any expert.

In fact a simple example (Cesa-Bianchi and Lugosi 2006, 67) shows that there can be no optimal method for the two-valued case: Let there be two experts one of whom always predicts one and the other of whom always predicts zero, and let the data stream include at each trial just the opposite of M's prediction. Then at least one of the experts has a success rate of at least 0.5, and the success rate of M is constantly zero.

The continuous case; weighted average prediction

Schurz went on to show that in the continuous case, in which outcomes and predictions take real values in the closed [0, 1] interval, a much stronger result is available. This depends upon the notion of attraction of experts: An expert e is attractive (to the meta-inductivist M) at trial n if e's success rate at n exceeds that of M, and the (strength of) attraction of e at n is just the difference

At(e, n, M) = SS(e, n) − SS(M, n) if this difference is positive

At(e, n) = 0 otherwise

The relative attraction, ρ, of e (to M) at n is just the normalized ratio of e's attraction at n to the sum of the attractions at n of all experts:

ρ(e, n) = At(e, n) / ∑iAt(ei, n)

At each trial n, as e varies, ρ(e, n) is a probability on the collection of experts, corresponding for each e to the extent of e's expertise at n. Experts whose attraction is not positive, i.e., whose success rate at n does not exceed M's, have relative attraction zero at n.

Let P(e, n + 1) be the prediction at n of the expert e for trial n + 1. At each n, P(e, n + 1) is a finite random variable with distribution ρ(e, n). The weighted prediction of e at n for n + 1 weights this random variable at n by the probability ρ(e, n).

WP(e, n + 1) = ρ(e, n) P(e, n + 1) = [At(e, n) / ∑iAt(ei, n)]P(e, n + 1)

And the average weighted prediction of all positively attractive experts at n is the weighted mean of the distribution of P(e, n + 1)

π(n + 1) = ∑j [WP(ej, n + 1)] / ∑iAt(ei, n) = [∑jAt(ej, n)P(ej, n + 1)] / ∑iAt(ei, n)]

(For n = 0 set π(n + 1) = 0.5.)

The weighted average method predicts x(n + 1) = π(n + 1)

Of course the data stream describes a series of contingent events. The weighted prediction method cannot affect the data stream and thus cannot assure that M's estimates or predictions are not uniformly far from the values of the outcomes x(n). What can be shown, however (Schurz Theorem 4[4]) is that under plausible structural constraints as the number n of trials increases π(n + 1) becomes increasingly close to the predictions of the maximally correct expert or experts. The difference between M's success rate and the maximum success rate of all experts approaches zero as the number n of trials increases without bound.

The binary case revisited

We saw above that there can be no optimal method, no method that assures a maximal success rate, in the two-valued case. The success rates of systematic deceivers may always be inaccessible. Of course the special instance of the continuous case, in which elements of the data stream and the predictions of experts are always either zero or one while π — the meta-inductive prediction — is in the closed [0, 1] interval, falls within the scope of the above result; the meta-inductivist will in the limit approach the maximal success rate (of zero – one expert predictions). There will necessarily be non-integral quantities interspersed in this sequence, and it rings a bit false to call these predictions; the meta-inductivist may know that the data are all integers, and he would in this case be announcing predictions that he knew a priori to be false. It serves clarity and plausibility to call the weighted averages what they are: estimates.

The resolution of the two-valued case can however be improved by applying the method of weighted – average prediction to binary data streams. The principle of the application is to use a cooperating team of meta-inductivists. One then applies weighted average prediction as above to find at each trial n the value π(n + 1). In an extremely simple illustration, which may nevertheless reveal the leading idea, we assume that this prediction (or estimate) is a rational in the [0, 1] interval. Suppose now that π(n + 1) = p/q and that there are just q meta-inductivists all told. Then the method directs that p meta-inductivists predict one and the remaining qp of them predict zero. The (considerable) complications to accommodate irrational quantities and different numbers of meta-inductivists accomplished, it can be shown that the mean success rate of the meta-inductive team approaches the maximal expert success rate in the limit. (Schurz Theorem 5)

It should be emphasized that it is perfectly compatible with weighted average prediction that at every trial π(n + 1) is far from x(n + 1). What is assured is that no expert can be much better, or in the limit any better, than this estimate.

Is there good reason to believe weighted average predictions? Not without some reason to credit the uniformity of the experts: that past success rates are good predictors of future success rates. Schurz briefly discusses some ways of supplementing the simple method by assumptions of reliability and the use of nomological predicates. These efforts aside, it is an advantage of meta-induction in general, and of its weighted-average form in particular, that it is free of synthetic principles and can thus contrast different object-inductive methods without substantive presupposition.

7. Induction, deduction and rationality

7.1 The project of D.C. Williams and D.C. Stove

A quite different view of the problem of induction asks not whether induction can be shown to lead to truth, but whether it is a rational or reasonable process. This is the approach of D.C. Williams and David Stove, and also of David Armstrong. (Williams 1947, Stove 1986, Armstrong 1991) We look first at the Williams-Stove account.

Williams argued in (Williams 1947) that one form of inductive inference is a reasonable method and that this proposition is, in fact, a necessary truth; Stove repeated the argument with a few corrections and reformulations four decades later. By claiming that induction is ‘reasonable’ Williams intended not only that it is characterized by ordinary sagacity. Indeed, he says that an aptitude for induction is just what we mean by ‘ordinary sagacity’. His claim is that induction is reasonable in the stronger (and not quite standard) sense of being “logical or according to logic.” (Williams 1947, 23)

Williams and Stove intended their accounts to defend reason against Hume's argument, in (Hume THN I.III.VI) discussed in section 2 above, that the principle of the uniformity of nature can be supported neither by deductive (“demonstrative”) arguments nor by contingent probabilistic methods, and hence that inductive inference is the work not of reason but of the imagination. Hume, according to Williams held that:

although our nervous tissue is so composed that when we have encountered a succession of Ms which are P we naturally expect the rest of the Ms to be P, and although this expectation has been borne out by the event in the past, the series of observations never provided a jot of logical reason for the expectation, and the fact that the inductive habit succeeded in the past is itself only a gigantic coincidence, giving no reason for supposing it will succeed in the future. (Williams 1947, 15)

Williams and Stove, for their part, maintain that, though there may be no demonstrative proof of the principle of the uniformity of nature, there are good demonstrative or deductive proofs that certain inductive methods yield their conclusions with high probability.

We first give an expository reconstruction of the Williams-Stove argument and then, guided by the analyses of Patrick Maher (Maher 1996) and Scott Campbell (Campbell 2001), remark on some of its complications and difficulties.

The specific form of inductive inference favored by Williams and Stove is what Carnap called inverse inference; inference to a character of a population on the basis of premises about a sample from that population.[5] Williams and Stove focus on inverse inferences about relative frequency. In particular on inferences of the form:

  1. The relative frequency of the trait R in the sufficiently large sample S from the finite population X is r. f(R | S) = r

therefore

  1. The relative frequency of R in X is close to r. f(R | X) ≈ r

(Williams 1947, 12; Stove 1986, 71–75) (This includes of course the special case in which r = 1.)

Williams and Stove both set out to show that it is necessarily true that the inference from (i) to (ii) has high probability:

Given a fair sized sample, then, from any [finite] population, with no further material information, we know logically that it very probably is one of those which [approximately] match the population, and hence that very probably the population has a composition similar to that which we discern in the sample. This is the logical justification of induction. (Williams 1947, 97)

Both Williams and Stove (Williams 1947, 162; Stove 1986, 77, 131–144) recognize that induction may depend upon context and also upon the nature of the traits and properties to which it is applied. Neither pretends to resolve the inductive paradoxes, and Stove, at least, does not propose to justify all inductions: “That all inductive inferences are justified is false in any case” (Stove 1986, 77).

Williams' initial argument was simple and persuasive. It turns out, however, to have difficulties. In response to one of these difficulties Stove weakened the thesis considerably, but this response may not be sufficient. There is the further problem that the sense of necessity at issue is not made precise and becomes increasingly stressed as the sometimes contentious dialectic plays out.

There are two lemmata or principles on which the Williams-Stove argument depends:

Lemma 1. If X is a large finite population in which the relative frequency of a character R is r, it is necessarily true that the relative frequency of R in most large samples from that population will be close to r.
Lemma 2. (The proportional syllogism.) When probability is symmetrical,[6] the probability that an individual in a finite population has a trait R is equal to the relative frequency of that trait in the population.

For remarks on the proofs, see the Supplement on the Two Lemmata.

Williams simple argument begins with an induction in a ‘hyperpopulation’ (Williams 1947, 94-96) of all samples of given size k (‘k-samples’) drawn from a large finite population X of individuals. The ‘individuals’ of the hyperpopulation are k-samples of individuals from the population X.

Now let Prob be a symmetrical probability. For given population X, trait R and k-sample S0 from X in which the r.f. of R is r, the content of (i) above can be expressed in two premises:

Premise A. S0 is a k-sample from X.
Premise B. The r.f. of R in S0 is r, i.e., f(R | S0) = r.

Williams argued as follows. It follows from Lemma 1 that

(1) The r.f. of k-samples (in the hyperpopulation) that resemble X is high.

It follows from (1) and Lemma 2 that

(2) Prob(S0 resembles X) is high

It follows from Premise B that

(3) Prob[f(R | X) ≈ r | S0 resembles X] is high

It follows from (2) and (3) that

(4) Prob[f(R | X) ≈ r] is high

Hence, goes the argument, (i) above implies (ii)

We might like to reason in this way, and Williams did reason in this way, but as Stove pointed out in (Stove 1986, 65) the argument is not sound; it ignores the requirement of total evidence: Inductive inference in general and inductive conditional probabilities in particular are not monotonic; adding premises may change a good induction to a bad one and adding conditions may change, and sometimes reduce, the value of a conditional probability. Here (3) depends on Premise B but suppresses mention of it, thus failing to respect the requirement to take explicit account of all relevant and available evidence. Williams neglected the critical distinction between the probability of f(R | X) = r conditioned on resemblance:

Prob(f(R | X) = r | S0 resembles X)

and the probability when Premise B, the r.f. of R in S0, is added to the condition:

Prob[f(R | X) = r | S0 resembles X ∧ f(R | S0) = r]

When however the conditions of (3) are expanded to take account of Premise B,

(3*) It is necessarily true that Prob[f(R | X) = r | S0 resembles Xf(R | S0) = r ] ≈ 1

the result does not follow from the premises; (3*) is true for some values of r and not for others.

As Maher describes this effect: (and as Williams himself had pointed out (Williams 1947, 89)), “Sample proportions near 0 or 1 increase the probability that the population is nearly homogeneous which, ceteris paribus, increases the probability that the sample matches the population; conversely, sample proportions around 1/2 will, ceteris paribus, decrease the probability of matching” (Maher 1996, 426). Thus the addition of Premise B to the condition of (3) might decrease the probability that S0 resembles the population X.

Stove's response to this difficulty was to point out, first, that neither he nor Williams ever claimed that every inductive inference, nor even that every instance of the (i) to (ii) inference, was necessarily highly probable. Stove, at least, agrees that there are instances of r, X, S and R for which it is not necessary that f(R | X) ≈ r follows from Premises A and B with high probability. According to Stove however, all that was needed to establish Williams' thesis was to give one case (of values for r, X, S and R) for which the inference holds necessarily: This would show that at least one inductive inference was necessarily rational. And this, he argued, will follow if we specify values of these parameters for which Premise B is not negatively relevant to (3). I.e., such that

Prob[f(R | X) = r | S0 resembles Xf(R | S0) = r ] ≥ Prob[f(R | X) = r | S0 resembles X]

is a necessary truth.

Stove provides a specific instance of Williams' argument in which:

X is the population of ravens and k = 3020.[7]

Premise A* is: S0 is a 3020-sample of ravens

Premise B* is: f(Black | S0) = 0.95

The critical part of the argument is the proof that

C*. Prob[f(Black | Ravens) ≈ 0.95 | S0 resembles Ravens ∧ f(R | S0) = r ] ≥ Prob[f(Black | Ravens) ≈ 0.95 | S0 resembles Ravens ]

which Stove accomplishes in detail and concludes that:

It follows necessarily from Premises A* and B* that Prob[f(Black | Ravens)] ≈ 0.95.

It must be remarked that this argument yields an attenuated form of Williams' original claim, quoted above, that sample to population inductive inferences are in general assured: “Given a fair sized sample, then, from any [finite] population, with no further material information, we know logically that it very probably is one of those which [approximately] match the population, and hence that very probably the population has a composition similar to that which we discern in the sample” (Williams 1947, 97). What we have in its place is that we know logically not that all such inferences are necessarily sound with high probability, but, at best, only that one carefully selected inference is, and then only when the probability in question is symmetrical.

Maher went on to argue that Stove's proof was in fact insufficient and that the crucial claim:

Prob[f(Black | Ravens) ≈ rS0 resembles Ravens ∧ f(Black | S0) = 0.95] ≥ Prob[f(Black | Ravens) ≈ 0.95 | S0 resembles Ravens]

does not follow from premises A* and B*. He claimed that

[I]f some population proportions are more probable than others a priori, then sample proportions close to those more probable proportions will increase the probability of matching [i.e. of resemblance] while sample proportions far from the more probable proportions will decrease the probability of matching. (Maher 1996, 426)

The anticipated response to this from the defenders of Williams and Stove is that a priori probabilities are fixed, as Williams insisted, by the principle of indifference cited above (“cases are equiprobable unless they are known not to be equiprobable”) (Williams 1947, 72) and the classical definition of probability as the ratio of favorable to possible cases. But, as Maher pointed out, the principle of indifference presupposes that the exclusive and exhaustive possible cases are specified and fixed. A collection of propositions can be partitioned in alternative ways yielding different possible cases and in consequence different probabilities. To underscore this relativity Maher gave a plausible partition that yields prior probabilities in Stove's example, such that (assuming Premises A* and B*):

Prob(S0 resembles Ravens) ≈ 1

Prob[f(Black | Ravens] ≈ 0.95 ) ≈ 0

Prob[S0 resembles Ravens | f(Black| S0) = 0.95 ] ≈ 0

Hence

Prob[S0 resembles Ravens | f(Black| S0) = 0.95 ] < Prob[S0 resembles Ravens]

thus falsifying C* and the conclusion of Stove's argument.

Scott Campbell, in (Campbell 2001) responded to Maher's second argument with two principal criticisms: Both claim that the low prior probability of the hypothesis ‘f(Black| X) ≈ 0.95’ consequent upon applying the principle of indifference to Maher's partition forces an artificially low posterior probability even when the evidence supports a higher posterior.

Williams' original argument when expressed in general terms is simple and seductive: It is a combinatorial fact that the relative frequency of a trait in a large population is matched by its relative frequency in most large samples from that population. The proportional syllogism is a truth of probability theory: in the symmetrical case, relative frequency equals probability. From these it looks to follow that it is a necessary truth that it is highly probable that the frequency of a trait in a given sample from an inclusive population is close to its frequency in the population. We have seen that in these terms the consequence does not follow: “[S]ample proportions near 0 or 1 increase the probability that the population is nearly homogeneous which, ceteris paribus, increases the probability that the sample matches the population; conversely, sample proportions around 1/2 will, ceteris paribus, decrease the probability of matching,” as Maher expresses it. (Maher 1996, 426) Stove proposed a weakened thesis that for certain select samples and populations this effect is minimized, and that in these cases the conclusion does follow, thus partially justifying Williams' claim that certain inductive inferences necessarily yield their conclusions with high probability.

Maher argued that when prior probabilities are properly taken account of C in Stove's argument is seen to be false, and Campbell criticized this argument as depending upon faulty assignments of prior probabilities. Independently of the outcome of this particular disagreement, it is plausible that there are at least some examples of inductions, of instances of r, X, S and R, for which the Williams-Stove thesis is true for which, that is to say, it is necessary that f(R | X) ≈ r follows from Premises A and B with high probability. But the Williams-Stove thesis emerges from this dialectic considerably diluted: It began life as the strong and simple modal assertion that it is a necessary truth that inductions of a quite common sort yield their conclusions with high probability. That thesis is seen to be false. What remains are, at best, certain specific instances of it.

7.2 David Armstrong on states of affairs, laws and induction.

D.M. Armstrong, like Williams and Stove, is a rationalist about induction. There is however a significant difference of emphasis and structure that marks Armstrong's approach off from that of Williams and Stove: The problem of induction was for the latter couple the topic and focus of their work on the question. Armstrong's major project on the other hand has for some three decades been the formulation and development of a theory of universals. (See the entry on properties where Armstrong's theory is discussed.) The problem of induction is treated in a brief paper (Armstrong 1991) and an eight-page section in (Armstrong 1983), which work is itself an application of the theory of universals. Armstrong's account of the problem of induction thus gains depth and richness, first in the light of his thesis that laws of nature are connections of universals, announced and defended in (Armstrong 1983) and secondly because it is a natural application of the elaborate theory of universals and states of affairs in which this thesis is developed. This theory yields a few essential metaphysical principles that underlie much of Armstrong's philosophy of science, including his views on induction, and that are usefully kept in mind:

Naturalism and physicalism:
Everything that exists is a physical entity in space / time.

Factualism:
Everything that exists is either (i) a state of affairs or (ii) a constituent of a state of affairs. These constituents include properties (including relations) and particulars.

Properties are of two sorts:
There are universals and ordinary, or second-class, properties. The difference between them is that second-class properties belong to particulars contingently, while this relation is always necessary in the case of universals.

About one-third of (Armstrong 1983) is devoted to stating and supporting three criticisms of what Armstrong calls the regularity theory of law. Put very generally, the various forms of the regularity theory all count laws, if they count them at all, as contingent generalizations or mere descriptions of the events to which they apply: “All there is in the world is a vast mosaic of local matters of fact, just one little thing and then another” as David Lewis put this view in (Lewis 1986, ix). One sort of regularity theory holds that laws of nature supervene on Lewis's vast mosaic. Armstrong argues against all forms of the regularity theory. Laws, on his view, are necessary connections of universals that neither depend nor supervene on the course of worldly events but determine, restrict, and govern those events. The law statement, a linguistic assertion, must in his view be distinguished from the law itself. The law itself is not linguistic, it is a state of affairs; “that state of affairs in the world which makes the law statement true” (Armstrong 1991, 505). A law of nature is represented as ‘N(F, G)’ where F and G are universals and N indicates necessitation: Necessitation is inexplicable, it is “a primitive, which we are forced to postulate” (Armstrong 1983, 92). That each F is a G, however, “does not entail that F-ness [the universal F] has N to G-ness” (Armstrong 1983, 85). That is to say that the extensional inclusion ‘all Fs are Gs ‘ may be an accidental generalization and does not imply a lawlike connection between Fs and Gs. In a “first formulation” of the theory of laws of nature (Armstrong 1983, 85), if N(F, G) is a law, “it entails the corresponding Humean or cosmic uniformity: (x)(FxGx)”. In later reconsideration, (Armstrong 1983, 149) however, this claim is withdrawn: N(F, G) does not entail that all Fs are Gs, for some Fs may be “interfered with,” preventing the law's power from its work.

Armstrong's rationalism does not lead him, as it did Williams and Stove, to see the resolution of the problem of induction as a matter of demonstrating that induction is necessarily a rational procedure: “[O]rdinary inductive inference, ordinary inference from the observed to the unobserved , is, although invalid, nevertheless a rational form of inference. I add that not merely is it the case that induction is rational, but it is a necessary truth that it is so” (Armstrong 1983, 52). Armstrong does not argue for this principle; it is a premise of an argument to the conclusion that regularity views imply the inevitability of inductive skepticism; the view, attributed to Hume, that inferences from the observed to the unobserved are not rational (Armstrong 1983, 52). Armstrong seems to understand ‘rational’ not in Williams' stronger sense of entailing deductive proofs, but in the more standard sense of (as the OED defines it) “Exercising (or able to exercise) one's reason in a proper manner; having sound judgement; sensible, sane.” (Williams' “ordinary sagacity,” near enough.)

The problem of induction for Armstrong is to explain why the rationality of induction is a necessary truth. (Armstrong 1983, 52) Or, in a later formulation, to lay out “a structure of reasoning which will more fully reconcile us (the philosophers) to the rationality of induction” (Armstrong 1991, 505). His resolution of this problem has two “pillars” or fundamental principles. One of these is that laws of nature are objective natural necessities and, in particular, that they are necessary connections of universals. The second principle is that induction is a species of inference to the best explanation (IBE). “[T]he core idea is very simple: observed regularities are best explained by hypotheses of strong laws of nature [i.e., objective natural necessities], hypotheses which in turn entail conclusions about the unobserved” (Armstrong 2001, 503). IBE, as its name suggests, is an informal and non-metric form of likelihood methods. Gilbert Harman coined the term in (Harman 1965). See also Harman (1968). Harman argued that enumerative induction was best viewed as a a form of IBE: The explanandum is a collection of statements asserting that a number of Fs are Gs and the absence of contrary instances, and the explanans, the best explanation, is the universal generalization, all Fs are Gs. IBE is clearly more general than simple enumerative induction, can compare and evaluate competing inductions, and can fill in supportive hypotheses not themselves instances of enumerative induction. (Armstrong's affinity for IBE should not lead one to think that he shares other parts of Harman's views on induction.)

An instantiation of a law is of the form

N(F, G) a's being F, a's being G

where a is an individual. Such instantiations are states of affairs in their own right.

As concerns the problem of induction, the need to explain why inductive inferences are necessarily rational, one part of Armstrong's resolution of the problem can be seen as a response to the challenge put sharply by Goodman: Which universal generalizations are supported by their instances? Armstrong holds that necessary connections of universals, like N(F, G), are lawlike, supported by their instances, and, if true, laws of nature. It remains to show how and why we come to believe these laws. Armstrong's proposal is that having observed many Fs that are G, and no contrary instances, IBE should lead us to accept the law N(F, G). “[T]he argument goes from the observed constant conjunction of characteristics to the existence of a strong law, and thence to a testable prediction that the conjunction will extend to all cases” (Armstrong 1991, 507).

7.3 Probabilistic laws of nature

Armstrong's theory is more ambitious than the Williams-Stove account of induction in also including an effort to account for probabilistic laws of nature. These are of the form

(i) (Pr:P)(F, G)

and their instances are of the form

(ii) (Pr:P)(F, G) a's being F, a's being G

where (i) “gives the objective probability of an F being G, a probability holding in virtue of the universals F and G”. Probabilistic laws, says Armstrong, are probabilities of necessitation, not necessitations of probabilities. (Armstrong 1983, 128).

Several difficulties come up immediately: If probabilistic laws are to conform to the laws of probability, (ii) should imply

(Pr:1 − P)(F, G) (a's being F, a's not being G)

or

(Pr:1 − P)(F, not-G) (a's being F, a's being not-G)

But this would contradict Armstrong's prohibition of negative states of affairs. (“Absences and lacks are ontologically suspect” (Armstrong 1983, 129).) A further problem is that non-probabilistic laws, of the form N(F, G) are necessitation relations (themselves states of affairs) holding between states of affairs, a's being F and a's being G. Since (i) is probabilistic, it looks that it may hold for given a that is F even when a is not G. The relation (ii) would then be incomplete, lacking its second term. Armstrong's response to this problem is The Principle of Instantiation: the requirement that laws (Pr:P)(F, G) be instantiated only by individuals that are both F and G.

But as Bas van Fraassen pointed out with detailed examples in “Armstrong on Laws and Probabilities” (1987), since this requirement prohibits additivity, it blocks application of the laws of probability. This led Armstrong to require that probabilistic laws be instantiated only when their range is infinite. In the finite case they are to be considered counterfactuals.

The Principle of Instantiation has also the consequence that the ties of probability to frequency are broken. The laws of large numbers and the classical limit theorems are hence apparently inapplicable to Armstrong's probabilistic laws, and one is left to wonder why these laws should be called probabilistic. (See Slowik 2005 for a defense of Armstrong's theory against van Fraassen's criticisms.)

8. Justification and Support of Induction

Hume's argument, famous in its generalized version, was for him a lemma in his positive account of induction. That account makes of induction a habit of the imagining mind: Previous impressions of a cause followed by impressions of its effect form a habit which calls up the idea of the effect upon a new impression of the cause. Hume even gives the details of probabilistic reasoning founded on this same simple model.

We remarked that Hume himself qualifies the bare statement of this theory. Wise men, he says, review their inferences and reflect upon their reliability. This review may lead one to correct reasoning in view of past errors: Noting that I've persistently misestimated the chances of rain, I may revise my forecast for tomorrow. The process is properly speaking not circular but regressive or hierarchical; a meteorological induction is reviewed by an induction not about meteorology but about inductions, Notice also that the revision of a forecast of rain may strengthen or reduce belief in rain, but may also, to put it in modern terms, increase dispersion: What was a pointed forecast of 2/3 becomes a less precise belief interval, from about (say) 1/2 to 3/4. This uncertainty will propagate up the hierarchy of inductions: Reflection leads me to be less certain about my reasoning about weather forecasts. Continuing the process must, in Hume's elegant phrase, “weaken still further our first evidence, and must itself be weaken'd by a fourth doubt of the same kind, and so on in infinitum.” How is it then that our cognitive faculties are not totally paralyzed? How do we “retain a degree of belief, which is sufficient for our purpose, either in philosophy or in common life” (Hume THN, 182, 185). How do we ever arrive at beliefs about the weather, not to speak of the laws of physics?

8.1 General rules and higher-order inductions

Hume's resolution of this puzzle is in terms of general rules, rules for judging (Hume THN, 150). These are of two sorts. Rules of the first sort lead to singular predictive inferences when triggered by the experience of successive instances. These when unchecked may tempt us to wider and more varied predictions than the evidence supports (to grue-type inferences, for example). Rules of the second sort are corrective, these lead us to correct and limit the application of rules of the first sort on the basis of evidence of their unreliability. It is only by following general rules, says Hume, that we can correct their errors. (See Bates 2005 for a discussion of this process.)

Recall that Reichenbach gave an account of higher order or, as he called them, concatenated, probabilities in terms of arrays or matrices. The second-order probability

P{[P(C | B) = p] | A} = q

is defined as the limit of a sequence of first order probabilities. This gives a way in a Reichenbachean framework of inductively evaluating inductions in a given class or sort. Reichenbach refers to this as the self-corrective method, and he cites Peirce, “who mentioned ‘the constant tendency of induction to correct itself,’” as a predecessor (Reichenbach TOP, 446n, Peirce 1935, Volume II, 456). Peirce consistently thinks this way: “Given a certain state of things, required to know what proportion of all synthetic inferences relating to it will be true within a given degree of approximation” (Peirce 1935, 184). Ramsey cites Mill approvingly for “his way of treating the subject as a body of inductions about inductions” (Ramsey 1931, 198). See, e.g. (Mill 2002, 209). “This is a kind of pragmatism:” Ramsey writes, “we judge mental habits by whether they work, i.e., whether the opinions they lead to are for the most part true” (Ramsey 1931, 197–198). Hume went so far as to give a set of eight “Rules by which to judge of causes and effects” (Hume THN, I.III.15), obvious predecessors of Mill's canons.

8.2 Assessing the reliability of inductive inferences: calibration

These considerations suggest deemphasizing the question of justification—show that inductive arguments lead from truths to truths—in favor of exploring methods to assess the reliability of specific inferences. How is this to be done? If after observing repeated trials of a phenomenon we predict success of the next trial with a probability of 2/3, how is this prediction to be counted as right or wrong? The trial will either be a success or not; it can't be two-thirds successful. The approach favored by the thinkers mentioned above is to evaluate not individual inferences or beliefs, but habits of forming such beliefs or making such inferences.

One method for checking on probabilistic inferences can be illustrated in probabilistic weather predictions. Consider a weather forecaster who issues daily probabilistic forecasts for the following day. For simplicity of illustration suppose that only predictions of rain are in question, and that there are just a few distinct probabilities (e.g., 0, 1/10, …, 9/10, 1). We say that the forecaster is perfectly calibrated if for each probability p, the relative frequency of rainy days following a forecast of rain with probability p is just p, and that calibration is better as these relative frequencies approach the corresponding probabilities. Without going into the details of the calculation, the rationale for calibration is clear: For each probability p we treat the days following a forecast of probability p as so many Bernoulli trials with probability p of success. The difference between the binomial quotient and p then measures the goodness of calibration; the smaller the difference the better the calibration.

This account of calibration has an obvious flaw: A forecaster who knows that the relative frequency of rainy days overall is p can issue a forecast of rain with probability p every day. He will then be perfectly calibrated with very little effort, though his forecasts are not very informative. The standard way to improve this method of calibration was designed by Glenn Brier in (Brier 1950). In addition to calibrating probabilities with relative frequencies it weights favorably forecast probabilities that are closer to zero and one. The method can be illustrated in the case of forecasts with two possible outcomes, rain or not. If there are n forecasts, let pi be the forecast probability of rain on trial i, qi = (1 − pi), 1 ≤ in, and let Ei be a random variable which is one if outcome i is rain and zero otherwise. Then the Brier Score for the n forecasts is

B = (1/n)i(piEi)2(qiEi)2

Low Brier scores indicate good forecasting: The minimum is reached when the forecasts are all either zero or one and all correct, then B = 0. The maximum is when the forecasts are all either zero or 1 and all in error, then B = 1. More recently the method has been ramified and applied to subjective probabilities in general. See (van Fraassen 1983).

8.3 Induction and deduction

If the inductive support of induction need not be simply circular, the deductive support of induction is also seen upon closer examination not to be as easily dismissed as the Humean dilemma might make it seem. The laws of large numbers are the foundations of inductive inference relating frequencies and probabilities. These laws are mathematical consequences of the laws of probability and hence necessary truths. Of course the application of these laws in any given empirical situation will require contingent assumptions, but the inductive part of the reasoning certainly depends upon the deductively established laws.

8.3.1 The Humean dilemma revisited

As concerns deductive justifications, proofs of the reliability of inductions, there are a number of these: The Williams-Stove approach should, its flaws repaired, prove deductively that certain sample-to-population inferences are reliable with high probability; the Neyman-Pearson lemma establishes this for the inductive comparison of conflicting statistical hypotheses; and formal learning theory provides the means to prove deductively the superiority of specific research methods. De Finetti's representation theorem is a deductively established truth that justifies probabilistic predictions on the basis of frequency evidence, Reichenbach's theory provides deductive assurance of the reliability of certain inductive rules, and we have seen a simple case of Carnap's proof of the proportional syllogism.

These results raise an obvious question: Where does Hume's simple dilemma argument go wrong?

The answer has two cooperating parts: There is first the insufficiency of the logic that Hume had at hand. This was based in an algebra of ideas, structured by relations of overlap, exclusion and inclusion. Logical entailment was just the inclusion of the conclusion idea in that of the premise, in which case the premise was unthinkable in the absence of the conclusion. “By knowledge, ”writes Hume, “I mean the assurance arising from the comparison of ideas” (Hume THN, 124). This foundation supports only a very weak logic. Secondly, probabilistic inference was for Hume a function of the imagination and, as such, was always “attended by uncertainty” (Hume THN, 124). That probability is determined by a simple set of laws and may easily be considered an extension of deductive logic, either of the object language, as by Reichenbach, or of the metalanguage, as by Carnap, is well beyond the reach of this scheme. Thus the sort of reasoning practiced by modern probabilists awaited the birth of modern logic and the axiomatization of probability.

8.3.2 The metaphysical and epistemological problems revisited.

If the problem of induction is to distinguish good from bad inductions, then its metaphysical form—say in what the difference consists—seems insoluble. Nor does there seem to be any general solution to the epistemological form of the problem—find a method for distinguishing good or reliable inductive habits from bad or unreliable habits. But modest efforts to solve special cases of the epistemological problem, some of which are discussed above, have in many cases enjoyed dramatic success.

8.4 Why trust induction? The question revisited

We can now return to the general question posed in section 1: Why trust induction more than other methods? Why not consult sacred writings, or “the wisdom of crowds” to explain and predict the movements of the planets, the weather, automotive breakdowns or the evolution of species?

8.4.1 The wisdom of crowds

The wisdom of crowds can appear to be an alternative to induction. James Surowiecki argued, in the book of this title (Surowiecki, 2004) with many interesting examples that groups often make better decisions than even informed individuals. It is important to emphasize that the model requires independence of the individual decisions and also a sort of diversity to assure that different sources of information are at work, so it is to be sharply distinguished from judging the mass opinion of a group that shares information and reaches a consensus in discussion. The obvious method suggested by Surowiecki's thesis is to consult polls or predictions markets rather than to experiment or sample on one's own. (See, for example, the link to prediction markets in the Other Internet Resources section of this entry.)

A precise justification for trusting the wisdom of crowds is provided by the Condorcet Jury Theorem. (See section 5.6.2 above) As the theorem makes evident,the wisdom of crowds is not to be contrasted with inductive reasoning, indeed it depends upon the inductive principle expressed in the Condorcet theorem to amalgamate correctly the individual testimonies as well as upon the diversity of individual reasonings. What is valuable in the method is the diversity of ways of forming beliefs. This amounts to a form of the requirement of total evidence, briefly discussed in section 3.3 above.

The wisdom of crowds can be seen as a primitive prolegomenon to social epistemology. Social epistemology studies methods - stronger and more sophisticated than those supported by the Condorcet theorem - by which groups may conduct inquiry, including scientific inquiry. It pretends not at all to replace induction, but to extend and enrich it. (See Goldman 1999 and the entry on social epistemology.)

As with Reichenbach's account of single-case probabilities, the wisdom of crowds depends essentially upon testimony.

8.4.2 Creationism and Intelligent Design

The wisdom of crowds thus depends upon good inductive reasoning. The use of sacred writings or other authorities to support judgments about worldly matters is, however, another matter. Christian creationism, a collection of views according to which the biblical myth of creation, primarily as found in the early chapters of the book of Genesis, explains, either in literal detail or in metaphorical language, the origins of life and the universe, is perhaps the most popular alternative to accepted physical theory and the Darwinian account of life forms in terms of natural selection. (See Ruse 2005 and the entry on creationism). Christian creationism, nurtured and propagated for the most part in the United States, contradicts inductively supported scientific theories, and depends not at all upon any recognizable inductive argument. Many of us find it difficult to take the view seriously, but, according to a 2005 poll by CBS news, 51% of Americans hold that God created Humans in their present form; 30% hold that humans evolved but God guided the process and 15% believe that humans evolved and God did not guide the process. (Alfano, 1955)

The apparent absurdity of Creationism has led some opponents of evolutionism and the doctrine of natural selection to eschew biblical forms of the view and to formulate a weaker thesis, known as the theory of intelligent design (Behe 1996, Dembski 1998). Intelligent design cites largely unquestioned evidence of two sorts: The delicate balance (that even a minute change in any of many physical constants would tip the physical universe into disequilibrium and chaotic collapse) and the complexity of life (that life forms on earth are very complex). The primary thesis of intelligent design is that the hypothesis of a designing intelligence explains these phenomena better than do current physical theories and Darwinian natural selection.

Intelligent design is thus not opposed to induction. Indeed its central argument is frankly inductive, a claim about likelihoods:

P(balance and complexity | intelligent design) > P(balance and complexity | current physics and biology)

Creationism and its offspring Intelligent Design have also an Islamic form, expressed in (Yahya 2007) also available online in eleven languages. The doctrines are gaining wide currency among Muslims in Europe who demand that they be taught in the schools.

There are a number of difficulties with the theory of intelligent design; these are explained in detail by Elliott Sober in (Sober 2002). (This article also includes an excellent primer on the sorts of probabilistic inference involved in the likelihood claim. See also the article on creationism.) Briefly put, there are problems of two sorts, both clearly put in Sober's article: First, intelligent design theorists “don't take even the first steps towards formulating an alternative theory of their own that confers probabilities on what we observe” as the likelihood principle would require (75). Second, the intelligent design argument depends upon a probabilistic fallacy. The biological argument, to restrict consideration to that, infers from

1. Prob(organisms are very complex | evolutionary theory) = low.
2. Organisms are very complex.

to

3. Prob(evolutionary theory) = low.

To see the fallacy, compare this with

1. Prob(double zero | the roulette wheel is fair) = low.
2. Double zero occurred.
Therefore,
3. Prob(the wheel is fair) = low.

What is to be emphasized here, however, is not the fallaciousness of the arguments adduced in favor of intelligent design. It is that intelligent design, far from presenting an alternative to induction, presumes certain important inductive principles.

8.4.3 Induction and testimony

Belief based on testimony, from the viewpoint of the present article, is not a form of induction. A testimonial inference has typically the form:

An agent A asserts that X.
A is reliable.
Therefore, X.

Or, in a more general probabilistic form:

1. An agent A asserts that X.
2. For any proposition X Pr(X | A asserts that X) = p.
Therefore,
3. Pr(X) = p.

In an alternative form the asserted content is quoted directly.

What is characteristic and critical in inference based on testimony is the inference from a premise in which the conclusion is expressed indirectly, in the context of the agent's assertion (A asserts that X), to a conclusion in which that content occurs directly, not mediated by language or mind (X). It is also important that testimony is always the testimony of some agent or agents. And testimonial inference is not causal; testimony is neither cause not effect of what is testified to. This is not to say that testimonial inference is less reliable than induction; only that it is different. (See Goldman 1999, chapter 4 for a thorough treatment of the reliability of testimony.)

Although testimonial inference may not be inductive, induction would be all but paralyzed were it not nourished by the testimony of authorities, witnesses, and sources. We hold that causal links between tobacco and cancer are well established by good inductive inferences, but the manifold data come to us through the testimony of epidemiological reports and, of course, texts that report the establishment of biological laws. Kepler's use of Tycho's planetary observations is a famous instance of induction based on testimony. Reichenbach's frequentist account of single-case probabilities as well as the wisdom of crowds require testimonial inference as input for their amalgamating inductions. And actuaries, those virtuosi of inductivism, depend entirely upon reports of data to base their conclusions. Of course inductive inferences from testified or reported data are no more reliable than the data.

8.5 Learning to love induction

There are really two questions here: Why trust specific inductive inferences? and Why trust induction as a general method? The response to the first question is: Trust specific inductions only to the extent that they are inductively supported or calibrated by higher-order inductions. It is a great virtue of Ramsey's counsel to treat “the subject as a body of inductions about inductions” that it opens the way to this. As concerns trust in induction as a general method of forming and connecting beliefs, induction is not all that easy to avoid; the wisdom of crowds and Intelligent Design seem superficially to be alternatives to induction, but both turn out upon closer examination to be inductive. Induction is, after all, founded on the expectation that characteristics of our experience will persist in experience to come, and that is a basic trait of human nature. “Nature”, writes Hume, “by an absolute and uncontroulable necessity has determin'd us to judge as well as to breathe and feel” (Hume THN, 183). “We are all convinced by inductive arguments”, says Ramsey, “and our conviction is reasonable because the world is so constituted that inductive arguments lead on the whole to true opinions. We are not, therefore, able to help trusting induction, nor, if we could help it do we see any reason why we should” (Ramsey 1931, 197). We can, however, trust selectively and reflectively; we can winnow out the ephemera of experience to find what is fundamental and enduring.

The great advantage of induction is not that it can be justified or validated, as can deduction, but that it can, with care and some luck, correct itself, as other methods do not.

8.6 Naturalized and evolutionary epistemology

“Our reason”, writes Hume, “must be consider'd as a kind of cause, of which truth is the natural effect; but such-a-one as by the irruption of other causes, and by the inconstancy of our mental powers, may frequently be prevented” (Hume THN, 180).

Perhaps the most robust contemporary approaches to the question of inductive soundness are naturalized epistemology and its variety evolutionary epistemology. These look at inductive reasoning as a natural process, the product, from the point of view of the latter, of evolutionary forces. An important division within naturalized epistemology exists between those who hold that there is little or no role in the study of induction for normative principles; that a distinction between correct and incorrect inductive methods has no more relevance than an analogous distinction between correct and incorrect species of mushroom; and those for whom epistemology should not only describe and categorize inductive methods but also must evaluate them with respect to their success or correctness.

The encyclopedia entries on these topics provide a comprehensive introduction to them.

Bibliography

Other Internet Resources

Related Entries

actualism | Bayes' Theorem | Carnap, Rudolf | conditionals | confirmation | epistemology: Bayesian | epistemology: evolutionary | epistemology: naturalized | epistemology: social | fictionalism: modal | Frege, Gottlob: logic, theorem, and foundations for arithmetic | Goodman, Nelson | Hempel, Carl | Hume, David | induction: new problem of | logic: inductive | logic: non-monotonic | memory | Mill, John Stuart | perception: epistemological problems of | Popper, Karl | probability, interpretations of | Ramsey, Frank | Reichenbach, Hans | testimony: epistemological problems of | Vienna Circle

Acknowledgments

Thanks to Kevin Kelly, Gerhard Schurz and Patrick Maher for helpful comments on sections 6.3.1, 6.3.2 and 7.1 respectively.