The Problem of Induction > Basic Probability (Stanford Encyclopedia of Philosophy/Summer 2010 Edition)

Supplement to The Problem of Induction

Basic Probability

For comprehensive accounts of probability, consult the articles: interpretations of probability, inductive logic, and Bayes' theorem. In the interest of avoiding redundancy only a simple sketch is given here.

Probability is easily defined on a first order language £ described in section 5.1.1 above. Any function P that assigns numerical values to the sentences of £ in accordance with the following laws is a probability:

For all sentences X and Y:

0 ≤ P(X) ≤ 1

P(¬X) = 1 − P(X)

If X and Y are logically inconsistent then P(X ∨ Y) = P(X) + P(Y)

If X is valid (i.e. true in all models) then P(X) = 1

The laws have as immediate consequences:

If X is inconsistent (i.e. true in no models) then P(X) = 0

P(X ∨ Y) = P(X) + P(Y) − P(X ∧ Y)

The conditional probability of Y given X, written P(Y | X), is defined as the ratio

P(X ∧ Y) / P(X)

when P(X) > 0.

This definition supports Bayes' Theorem:

If X₁, …, X_k are logically exclusive and exhaustive then
P(Y | X₁) = P(X₁ | Y)P(Y) / ∑_i P(X_i | Y)P(Y)

An alternative, and non-equivalent, development takes conditional probability as primitive and allows it to be defined when the probability of the condition is zero. See the references in interpretations of probability, section 1.

Conditional probability supports the definition of the critical concept of probabilistic independence. Intuitively, propositions are independent if the truth of either leaves the probability of the other unchanged. Explicitly:

X and Y are independent =_df P(Y | X) = P(Y)

it follows immediately that:

X and Y are independent iff P(X ∧ Y) = P(X)P(Y)

Independence is symmetrical; if X and Y are independent then Y and X are independent.

If X and Y are independent then X and ¬Y, ¬X and Y, ¬X and ¬Y are all independent.

Thorough independence is a stronger notion than simple or pairwise independence:

The propositions X₁, …, X_n are thoroughly independent =_df
If Y₁, …, Y_m are any among the X_i then P(Y₁ ∧ … ∧ Y_m) = P(Y₁)P(Y₂) … P(Y_m)

If X₁, …, X_n are thoroughly independent, P(X₁ ∧ … ∧ X_n) is positive and Y₁, …, Y_m, Y_m+1 are any distinct propositions among the X_i then

P(Y_m+1 | Y₁ ∧ … ∧ Y_m) = P(Y_m+1)

If X₁, …, X_n are thoroughly independent and X₁′, …, X_n′ result from negating some or all of them then X₁′, …, X_n′ are thoroughly independent.

Notice that the propositions X₁, …, X_n may be pairwise independent (i.e. every two of them may be independent) without being thoroughly independent.

A set A ⊆ £ of sentences of £ is maximal if no superset of A is consistent; if A is maximal and X is any sentence not in A, then A ∪ {X} is inconsistent. If A is maximal and consistent and X is any sentence of £ then either X is in A or ¬X is in A. (See modal fictionalism, section 4.5 and actualism, section 1.1). If x is any (normal) model then the set of all sentences true in x (sometimes called the diagram of x) is maximal and consistent. Conversely every maximal consistent subset of an interpreted language (the referents of the non-logical constants being fixed) determines a unique normal model. Maximal consistent sets thus correspond with models in a clear and simple way.

The introduction touched on stronger relativized notions of necessity determined by restricting the class of models. There it was a question of arithmetic models, those in which the arithmetical signs had their standard arithmetical meaning. In the same way models may be restricted to those in which certain contingent propositions are true, and this sort of relativization is virtually imperative in discussing induction. Thus when experts on automobile safety conduct experiments to test hypotheses about the damage to cars in controlled crashes, they will assume the truth of laws about the strength of materials, not to speak of the laws of classical physics. They will not count as possible any model in which any of these laws do not obtain.

Now given an interpreted language £ we can refer interchangeably to the models (perhaps restricted) of £ and the maximal consistent sets (perhaps in consequence also restricted) of £. If the number of (perhaps restricted) models is finite then £ is said to be a finite language. There will of course be denumerably infinitely many sentences of £, but there will be only finitely many maximal consistent classes of sentences. Thus, for example, if the models in question consist of the eight possible outcomes of three draws from an urn containing three balls, each of which is either red or black, we may represent this in a language with just one predicate, R (for red), and three individual constants a, b, and c, for the three draws. Each sequence, such as Ra ∧ ¬Rb ∧ Rc, known as a constitution, will determine a maximal consistent class or atom, namely the class of all of its logical consequences, and there will be just eight of these classes, each of which determines a model.

Probability fits naturally in this framework. In this example any assignment of non-negative numbers summing to one to the eight constitutions fixes the probability of every sentence in the language:

P(X) = ∑_{α_i implies X} P(α_i)

where α_i ranges over the eight constitutions.

Let R be a monadic predicate and a₁, …, a_n distinct individual constants. We write R_i for R(a_i). There is just one sequence R₁ ∧ … ∧ R_n, there are n sequences with (n−1) R and one R′ = ¬R. In general there are

( n
k ) = n! / k!(n−k)!

= n(n−1)(n−2) … (n−k+1) / k(k−1)(k−2) … 1

distinct sequences with k R and (n−k) ¬R.

Suppose that Ra₁, …, Ra_n are thoroughly independent and equiprobable, i.e. P(a₁) = … = P(a_n) = p. Then P(Ra₁ ∧ … ∧ Ra_n) = pⁿ. And, in general, P(Ra₁ ∧ … ∧ Ra_k ∧ ¬Ra_k+1 ∧ … ∧ Ra_n) = p^kq^n-k where q = 1 −p. Indeed, if Ra₁′, …, Ra_n′ result from negating any n−k of Ra₁, …, Ra_n then P(Ra₁′ ∧ … ∧ Ra_n′) = p^kq^n−k.

Sequences of independent and equiprobable sentences like this are known as Bernoulli trials. They provide probabilistic descriptions of many random phenomena in which an experiment, such as tosses of a coin or draws with replacement of balls of two sorts from an urn, is repeated in uniform conditions. One outcome (red ball, heads) is typically labeled “success”. The frequency of success in such a sequence is just the number of successful outcomes. If there are k red balls and n−k black balls drawn in n trials, then the frequency of success is k. The relative frequency (r.f.) of success is the frequency of success divided by the number of trials: k/n in the present example.

In the Bernoulli case the probability of k successes in n trials, in any order whatever, is

b(n, k, p) = ( n
k ) p^kq^(n−k)

This is the binomial distribution, frequently encountered in works on probability and induction.

Many inductive inferences concern the relations between probabilities and relative frequencies. Relative frequencies conform to the laws of probability:

r.f. of ¬R = 1 − r.f. of R

If r.f. of R ∧ S = 0 then r.f. of R ∨ S = r.f. of R + r.f. of S

In the case of Bernoulli trials we expect the relative frequency of success to be close to the probability of success. If two-thirds of the balls in the urn are red it is plausible to identify this ratio with the probability of success and to expect that the r.f of success will be close to 2/3. This is of course not a sure thing: the r.f. of success in actual draws may be anything from zero to one. The precise relationship between r.f. and probability is expressed in Bernoulli's weak law of large numbers. This law says that in the Bernoulli case with probability p of success, the proportion of sequences in which the r.f. of success differs from p by more than an arbitrary small quantity ε approaches zero as the number of trials increases without bound. Symbolically, where

p = probability of success
f = frequency of success
n = number of trials
Prob = r.f. of sequences satisfying the condition

Prob[|f/n − p| > ε] approaches zero as n increases without bound. (See Cramer, 1955 section 6.3 for an accessible proof.)

Remarks:

The theorem does not say that r.f. approaches probability in the limit. That is a frequent misunderstanding.
The theorem does not depend upon a frequency interpretation of probability. It is required of the probability of success only that it obey the laws of probability and conform to the strictures of independence and equiprobability; apart from that the probability p is left without interpretation or definition.

Bernoulli trials are also exchangeable: This means that probability is invariant for the order of the trials, depending only upon the number of trials and the relative frequency of success. Bernoulli trials are exchangeable, but the converse does not hold in general. (See section 5.3 of the main entry for a discussion of exchangeability and its relations with independence and equiprobability.)

A probability is symmetrical if it is invariant for the permutation of individuals (or individual constants). In the case of repeated trials symmetry is equivalent to exchangeability. Symmetrical probabilities also support a number of important results. The proportional syllogism says that when probability is symmetrical the probability of success is equal to the relative frequency of success. A special case of the proportional syllogism is proved in the supplement “Two Lemmata”. Exchangeability is also the key concept in the de Finetti representation theorem, discussed in section 5.3.

Random variables and expectations, both badly named

Most uses of probability depend upon random variables and expectations. Random variables, to begin with them, are not variables, they are functions. A random variable (r.v.) is a function that assigns a quantity—positive, negative, or zero—to each member of a given collection with a given probability. The function that gives the number of pips on each side of a fair die is a r.v. that takes on the values 1–6 with probabilities each = 1/6. The distribution of a r.v. lists its possible values and gives their probabilities. The expectation of a random variable is the weighted sum of its values, the weights being the probabilities of the respective values. Thus in the case of the fair die we write

E(N) = ∑_x Pr[N = x] = 3.5

In this case the expectation is just the mean of the values of the r.v. If, on the other hand, the die is shaved so as to favor the ace:

Pr[N = 1] = 1/5, Pr[N = 2] = … = Pr[N = 6] = 4/25

the expectation will shift accordingly.

E(N) = 3.4

A second important characteristic of a distribution is the standard deviation. This is a measure of how far away from the mean the values are, on the average. The s.d, abbreviated ‘σ’, is the expectation of the square of the distance from the mean.

σ = √E[(x − m)²)]

Continuous random variables are those that may take on all values in some continuous interval. Such a r.v. is finite if it assigns only a finite number of distinct probabilities. Certain families of distributions are studied for different reasons. Perhaps the best known is the normal or Gaussian distribution, this is the familiar bell-shaped curve. A normal distribution on an interval of the real line is easily specified by just two parameters; its mean and its variance, which is, roughly, the average distance from the mean of its values. The bell curve actually graphs not the normal distribution, but the normal frequency function. Its equation is

φ(x) = [1 / √(2π)]e^−x²/2

The normal distribution function is the integral of the frequency function, it says what the probability is that the value of the r.v. will be no greater than a given value. Its equation is

Φ(x) = [1 / 2π] x
∫
−∞ e^−t²/2dt

A second important distribution is the binomial distribution. This is the distribution described above that gives the probability of k successes in n trials. One of the most important results in the study of induction is the de Moivre-Laplace limit theorem, which states that as n and k increase without bound the binomial distribution approaches the normal distribution as a limit.

A few simple principles mark off conditional probability from logical implication.

Logical implication is monotonic. This means (roughly, see the article on non-monotonic logic for the full story) that if a set Γ implies a sentence A and if Γ ⊆ Δ, then Δ implies A. Conditional probability is, however, non-monotonic: P(C | A ∧ B) may be equal to, greater than, or less than P(C | A). Again A may raise the probability of C while A ∧ B lowers the probability of C:
Conditional probability does not allow contraposition. P(B | A) may be equal to, greater than, or less than P(¬A | ¬B).

(	n k	)	=	n! / k!(n−k)!
			=	n(n−1)(n−2) … (n−k+1) / k(k−1)(k−2) … 1