Inductive Logic > Notes (Stanford Encyclopedia of Philosophy/Summer 2010 Edition)

Notes to Inductive Logic

1. Although enumerative inductive arguments may seem similar to what classical statisticians call estimation, they are not really the same thing. As classical statisticians are quick to point out, estimation does not use the sample to inductively support a conclusion about the whole population. Estimation is not supposed to be a kind of inductive inference. Rather, estimation is a decision strategy. The sample frequency will be within two standard deviations of the population frequency in about 95% of all samples. So, if one adopts the strategy of accepting as true the claim that the population frequency is within two standard deviations of the sample frequency, and if one uses this strategy repeatedly for various samples, one should be right about 95% of the time. We will discuss enumerative induction in much more detail later in the article.

2. Another way of understanding axiom (5) is to view it as a generalization of the deduction theorem and its converse. The deduction theorem and converse says this: C ⊨ (B⊃A) if and only if (C·B) ⊨ A. Given axioms (1-4), axiom (5) is equivalent to the following:

5*. (1 − P_α[(B⊃A) | C]) = (1 − P_α[A | (B·C)]) · P_α[B | C].

The conditional probability P_α[A | (B·C)] completely discounts the possibility that B is false, whereas the probability of the conditional P_α[(B⊃A) | C] depends significantly on how probable B is (given C), and must approach 1 if P_α[B | C] is near 0. Rule (5*) captures how this difference between the conditional probability and the probability of a conditional works. It says that the distance below 1 of the support-strength of C for (B⊃A) equals the product of the distance below 1 of the support strength of (B·C) for A and the support strength of C for B. This makes good sense: the support of C for (B⊃A) (i.e., for (~B∨A)) is closer to 1 than the support of (B·C) for A by the multiplicative factor P_α[B | C], which reflects the degree to which C supports ~B. According to Rule (5*), then, for any fixed value of P_α[A | (B·C)] < 1, as P_α[B | C] approaches 0, P_α[(B⊃A) | C] must approach 1.

3. This is not what is commonly referred to as countable additivity. Countable additivity requires a language in which infinitely long disjunctions are defined. It would then specify that P_α[(B₁∨B₂∨…) | C] = ∑_i P_α[B_i | C]. The present result may be derived (without appealing to countable additivity) as follows. For each distinct i and j, let C ⊨ ~(B_i·b_j); and suppose that P_α[D | C] < 1 for at least one sentence D. First notice that we have, for each i, C ⊨ (~(B_i·b_i+1)·…· ~(B_i·b_n)); so C ⊨ ~(B_i·(B_i+1∨ …∨B_n)). Then, for each finite list of the B_i,

P_α[(B₁∨B₂∨ …∨B_n) | C]

P_α[B₁ | C] + P_α[(B₂∨… ∨B_n) | C]

…

n
∑
i=1

P_α[B_i | C].

By definition,

∑_i P_α[B_i | C] = lim_n n
∑
i=1 P_α[B_i | C].

So, lim_n P_α[(B₁∨ B₂∨…∨B_n) | C] = ∑_i P_α[B_i | C].

4. Here are the usual axioms when unconditional probability is taken as basic:

P_α is a function from statements to real numbers between 0 and 1 that satisfies the following rules:

if ⊨A (i.e. if A is a logical truth), then P_α[A] = 1;

if ⊨~(A·B) (i.e. if A and B are logically incompatible), then P_α[(A∨B)] = P_α[A] + P_α[B];

Definition: if P_α[B] > 0, then by definition P_α[A | B] = P_α[(A·B)] / P_α[B].

5. Bayesians often refer to the probability of an evidence statement on a hypothesis, P[e | h·b·c], as the likelihood of the hypothesis. This can be a somewhat confusing convention since it is clearly the evidence that is made likely to whatever degree by the hypothesis. So, we will disregard the usual convention here. Also, presentations of probabilistic inductive logic often suppress c and b, and simply write ‘P[e | h]’. But c and b are important parts of the logic of the likelihoods. So we will continue to make them explicit.

6. These attempts have not been wholly satisfactory thus far, but research continues. For an illuminating discussion of the logic of direct inference and the difficulties involved in providing a formal account, see the series of papers (Levi, 1977), (Kyburg, 1978) and (Levi, 1978). Levi (1980) develops a very sophisticated approach.

Kyburg has developed a logic of statistical inference based solely on logical direct inference probabilities (Kyburg, 1974). Kyburg's logical probabilities do not satisfy the usual axioms of probability theory. The series of papers cited above compares Kyburg's approach to a kind of Bayesian inductive logic championed by Levi (e.g., in Levi, 1967).

7. This idea should not be confused with positivism. A version of positivism applied to likelihoods would hold that if two theories assign the same likelihood values to all possible evidence claims, then they are essentially the same theory, though they may be couched in different words. In short: same likelihoods implies same theory. The view suggested here, however, is not positivism, but its inverse, which should be much less controversial: different likelihoods implies different theories. That is, given that all of the relevant background knowledge is made explicit (represented in ‘b’), if two scientists disagree significantly about the likelihoods of important evidence claims on a given hypothesis, they must understand the empirical content of that hypothesis q.ite differently. To that extent, though they may employ the same sentences, the same syntactic expressions, they use them to express empirically distinct hypotheses.

8. Call an object grue at a given time just in case either the time is earlier than the the first second of the year 2030 and the object is green or the time not earlier than the first second of 2030 and the object is blue. Now the statement ‘All emeralds are green (at all times)’ has the same syntactic structure as ‘All emeralds are grue (at all times)’. So, if syntactic structure determines priors, then these two hypotheses should have the same prior probabilities. Indeed, both should have prior probabilities approaching 0. For, there are an infinite number of competitors of these two hypotheses, each sharing the same syntactic structure: consider the hypotheses ‘All emeralds are grue_n (at all times)’, where an object is grue_n at a given time just in case either the time is earlier than the first second of the n^th day after January 1, 2030, and the object is green or the time is not earlier than the first second of the n^th day after January 1, 2030, and the object is blue. A purely syntactic specification of the priors should assign all of these hypotheses the same prior probability. But these are mutually exclusive hypotheses; so their prior probabilities must sum to a value no greater than 1. The only way this can happen is for ‘All emeralds are green’ and each of its grue_n competitors to have prior probability values either equal to 0 or extremely close to it.

9. This assumption may be substantially relaxed without affecting the analysis below; we might instead only suppose that the ratios P_α[cⁿ | h_j·b]/P_α[cⁿ | h_i·b] are bounded so as not to get exceptionally far from 1. If that supposition were to fail, then the mere occurrence of the experimental conditions would count as very strong evidence for or against hypotheses — a highly implausible effect. Our analysis could include such bounded condition-ratios, but this would only add inessential complexity.

10. For example, when a new disease is discovered, a new hypothesis h_u+1 about that disease being a possible cause of patients’ symptoms is made explicit. The old catch-all was, “the symptoms are caused by some unknown disease — some disease other than h₁,…, h_u”. So the new catch-all hypothesis must now state that “the symptoms are caused by one of the remaining unknown diseases — some disease other than h₁,…, h_u, h_u+1”. And, clearly, P_α[h_K | b] = P_α[~h₁·…·~h_u | b] = P_α[~h₁·…·~h_u· (h_u+1∨~h_u+1) | b] = P_α[~h₁·…·~h_u·~h_u+1 | b] + P_α[h_u+1 | b] = P_α[h_K* | b] + P_α[h_u+1 | b]. Thus, the new hypothesis h_u+1 is “peeled off” of the old catch-all hypothesis h_K, leaving a new catch-all hypothesis h_K* with a prior probability value equal to that of the old catch-all minus the prior of the new hypothesis.

11. This claim depends, of course, on h_i being empirically distinct from each alternative h_j. I.e., there must be conditions c_k with possible outcomes o_ku on which the likelihoods differ: P[o_ku | h_i·b·c_k] ≠ P[o_ku | h_j·b·c_k]. Otherwise h_i and h_j are empirically equivalent, and no amount of evidence can support one over the other. (Did you think a confirmation theory could possibly do better? — could somehow employ evidence to confirm the true hypothesis over empirically equivalent rivals?) If the true hypothesis has empirically equivalent rivals, then convergence just implies that the odds against the disjunction of the true hypothesis with these rivals very probably goes to 0, and so the posterior probability of this disjunction goes to 1. Among empirically equivalent hypotheses the ratio of their posterior probabilities equals the ratio of their priors: P_α[h_j | b·cⁿ·eⁿ] / P_α[h_i | b·cⁿ·eⁿ] = P_α[h_j | b] / P_α[h_i | b]. So the true hypothesis will have a posterior near 1 (after evidence drives the posteriors of empirically distinguishable rivals near 0) just in case non-evidential considerations make its evidence-independent plausibility much higher than the sum of the plausibility ratings of any empirically equivalent rivals.

12. This is a good place to describe one reason for thinking that inductive support functions must be distinct from subjectivist or personalist degree-of-belief functions. Although likelihoods have a high degree of objectivity in many scientific contexts, it is difficult for belief functions to properly represent objective likelihoods. This is an aspect of the problem of old evidence.

Belief functions are supposed to provide an idealized model of belief strengths for agents. They extend the notion of ideally consistent belief to a probabilistic notion of ideally coherent belief strengths. There is no harm in this kind of idealization. It is supposed to supply a normative guide for real decision making. An agent is supposed to make decisions based on her belief-strengths about the state of the world, her belief strengths about possible consequences of actions, and her assessment of the desirability (or utility) of these consequences. But the very role that belief functions are supposed to play in decision making makes them ill-suited to inductive inferences where the likelihoods are often supposed to be objective, or at least possess inter-subjectively agreed values that represent the empirical import of hypotheses. For the purposes of decision making, degree-of-belief functions should represent the agent's belief strengths based on everything she presently knows. So, degree-of-belief likelihoods must represent how strongly the agent would believe the evidence if the hypothesis were added to everything else she presently knows. However, support-function likelihoods are supposed to represent what the hypothesis (together with explicit background and experimental conditions) says or implies about the evidence. As a result, degree-of-belief likelihoods are saddled with a version of the problem of old evidence – a problem not shared by support function likelihoods. And it turns out that the old evidence problem for likelihoods is much worse than is usually recognized.

Here is the problem. If the agent is already certain of an evidence statement e, then her belief-function likelihoods for that statement must be 1 on every hypothesis. I.e., if Q_γ is her belief function and Q_γ[e] = 1, then it follows from the axioms of probability theory that Q_γ[e | h_i·b·c] = 1, regardless of what h_i says — even if h_i implies that e is quite unlikely (given b·c). But the problem goes even deeper. It not only applies to evidence that the agent knows with certainty. It turns out that almost anything the agent learns that can change how strongly she believes e will also influence the value of her belief-function likelihood for e, because Q_γ[e | h_i·b·c] represents the agent's belief strength given everything she knows.

To see the difficulty with less-than-certain evidence, consider the following example. (I'll suppress the b and c here, as subjectivist Bayesians often do, since they will make no difference for present purposes.) A physician intends to test her patient for heart disease, h, with a treadmill test. She knows from medical studies that there is a 10% false negative rate for this test; so her belief-strength for a negative result, e, given heart disease is present, h, is Q_γ[e | h] = .10. Now, her nurse is very professional and is usually unaffected by patients’ test results. So, if asked, the physician would say her belief strength that her nurse will feel devastated, d, if the test is positive (i.e. if ~e) is around Q_γ[d | ~e] = .05. Let us suppose, as seems reasonable, that this belief-strength is independent of whether h is in fact true — i.e. Q_γ[d | ~e·h] = Q_γ[d | ~e]. The nurse then says to the physician, in a completely convincing way, “he is such a nice guy — if his test comes out positive, I'll be devastated.” The physician's new belief function (Q_γ′) likelihood for a false negative must then become Q_γ′[e | h] = Q_γ[e | h·(~e⊃d)] = .69 (since Q_γ[e | h·(~e⊃d)] = Q_γ[~e⊃d | h·e] · Q_γ[e | h] / (Q_γ[~e⊃d | h·e] · Q_γ[e | h] + Q_γ[~e⊃d | h·~e] · Q_γ[~e | h]) = Q_γ[e | h] / (Q_γ[e | h] + Q_γ[d | ~e·h] · Q_γ[~e | h]) = .1/(.1 + (.05)(.9)) = .69).

The point is that even the most trivial knowledge of conditional (or disjunctive) claims involving e may completely upset the value of the likelihood for an agent's belief function. And an agent will almost always have some such trivial knowledge, e.g., the physician in the previous example may also learn that if the treadmill test is negative for heart disease, then, (1) the patient's worried mother will throw a party, (2) the patient's insurance company won't cover additional tests, (3) it will be the thirty-seventh negative treadmill test result she has received for a patient this year,…, etc. Updating on such conditionals can force physicians’ belief functions to deviate widely from the evidentially relevant objective, textbook values of test result likelihoods.

More generally, it can be shown that the incorporation into Q_γ of almost any kind of evidence for or against the truth of a prospective evidence claim e — even uncertain evidence for e, as may come through Jeffrey updating — completely undermines the objective or inter-subjectively agreed likelihoods that a belief function might have expressed before updating. This should be no surprise. The agent's belief function likelihoods reflect her total degree-of-belief in e, based on h together with everything else she knows about e. So the agent's present belief function may capture appropriate, public likelihoods for e only if e is completely isolated from the agents other beliefs. And this will rarely be the case.

One Bayesian subjectivist response to this kind of problem is that the belief functions employed in scientific inductive inferences should often be “counterfactual” belief functions, which represent what the agent would believe if e were subtracted (in some suitable way) from everything else she knows (see, e.g. Howson & Urbach, 1993). However, our examples show that merely subtracting e won't do. One must also subtract any conditional statements containing e. And one must subtract any uncertain evidence for or against e as well. So the counterfactual belief function idea needs a lot of working out if it is to rescue the idea that subjectivist Bayesian belief functions can provide a viable account of the likelihoods employed by the sciences in inductive inferences.

13. To see the point more clearly, consider an example. To keep things simple, let's suppose our background b says that the chances of heads for tosses of this coin is some whole percentage between 0% and 100%. Let c say that the coin is tossed in the usual random way; let e say that the coin comes up heads; and for each r a whole fraction of 100 between 0 and 1, let h_[r] be the simple statistical hypothesis asserting that the chance of heads on each random toss of this coin is r. Now consider the composite statistical hypothesis h_[>.65], which asserts that the chance of heads on each random toss is greater than .65. From the axioms of probability we derive the following relationship: P_α[e | h_[>.65]·b] = P[e | h_[.66]·b] · P_α[h_[.66] | h_[>.65]·b] + P[e | h_[.67]·b] · P_α[h_[.67] | h_[>.65]·b] + …+ P[e | h_[1]·b] · P_α[h_[1] | h_[>.65]·b]. The issue for the likelihoodist is that the values of the terms of form P_α[h_[r] | h_[>.65]·b] are not objectively specified by the composite hypothesis h_[>.65] (together with b). But the value of the likelihood P_α[e | h_[>.65]·b] depends essentially these non-objective factors. So it fails to possess the kind of objectivity that likelihoodists requires.

14. The Law of Likelihood and the Likelihood Principle have been formulated in slightly different ways by various logicians and statisticians. The Law of Likelihood was first identified by that name in Hacking (1965), and has been invoked more recently by the likelihoodist statisticians A.F.W. Edwards (1972) and R. Royall (1997). R.A. Fisher (1922) argued for the Likelihood Principle early in the 20^th century, though he didn't call it that. One of the first places it is discussed under that name is (Savage, et al., 1962). It is also advocated by Edwards (1972) and Royall (1997).

15. To say that S is a random sample of population B with respect to attribute A means this: either, (1) the sample S is generated by a process that gives every member of B an equal chance of being selected into S, or (2) there is a subclass of B, call it C, from which S is generated by a process that gives every member of C an equal chance of being selected into S, where C is representative of B with respect to A in the sense that the frequency of A in C is almost precisely the same as the frequency of A in B. The idea is this. Ideally a poll of registered voters, B, should select a sample S in a way that gives every registered voter the same chance of getting into S. But that may be impractical. However, it suffices if the sample is selected from a representative subpopulation C of B — e.g., from registered voters, who answered the telephone between the hours of 7 PM and 9 PM in the middle of the week. Of course, the claim that a given subpopulation C is representative is itself a hypothesis that is open to inductive support by evidence. Professional polling organizations do a lot of research to calibrate their sampling technique, to find out what sort of subpopulations C they may draw on as highly representative. For example, one way to see if registered voters who answer the phone during the evening, mid-week, are likely to constitute a representative sample is to conduct a large poll of such voters immediately after an election, when the result is known, to see how representative of the actual vote count the vote count from of the subpopulation turns out to be.

16. This is a simple version of the Stable-Estimation Theorem of (Edwards, Lindman, Savage, 1993).

17. To get a better idea of the import of this theorem, let's consider some specific values. First notice that the factor r·(1−r) can never be larger than 1/2·1/2 = 1/4; and the closer r is to 1 or 0, the smaller r·(1−r) becomes. So, whatever the value of r, the factor q/((r·(1−r)/n)^½ ≤ 2·q·n^½. Thus, for any chosen value of q,

P[r−q < F[A,B∩S] < r+q | F[A,B] = r·Random[S,B,A]·Size[S] = n] ≥ − 2·Φ[−2·q·n^½].

For example, if q = .05 and n = 400, then we have (for any value of r),

P[r−.05 < F[A,B∩S] < r+.05 | F[A,B] = r·Random[S,B,A]·Size[S] = 400] ≥ .95.

For n = 900 (and margin q = .05) this lower bound raises to .997:

P[r−.05 < F[A,B∩S] < r+.05 | F[A,B] = r·Random[S,B,A]·Size[S] = 900] ≥ .997.

If we are interested in a smaller margin of error q, we can keep the same sample size and find the value of the lower bound for that value of q. For example,

P[r−.03 < F[A,B∩S] < r+.03|F[A,B] = r·Random[S,B,A]·Size[S] = 900] ≥ .928.

By increasing the sample size the bound on the likelihood can be made as close to 1 as we want, for any margin q we choose. For example:

P[r−.01<F[A,B∩S] <r+.01|F[A,B] = r·Random[S,B,A]·Size[S] = 38000] ≥ .9999.

As the sample size n becomes larger, it becomes extremely likely that the sample frequency will get as close to the true frequency r as anyone may desire.

18. That is, for each inductive support function P_α, the posterior P_α[h_j | b·cⁿ·eⁿ] must go to 0 if the ratio P_α[h_j | b·cⁿ·eⁿ] / P_α[h_i | b·cⁿ·eⁿ] goes to 0; and that will occur if the likelihood ratios P[eⁿ | h_j·b·cⁿ] / P[eⁿ | h_i·b·cⁿ] approach 0 and the prior P_α[h_i | b] is greater than 0. The Likelihood Ratio Convergence Theorem will show that when h_i·b is true, it is very likely that the evidence will indeed be such as to drive the likelihood ratios as near to 0 as you please, for a long enough evidence stream. If that happens, the only way a Bayesian agent can avoid having his inductive support function yield posterior probabilities for h_j approaching 0 (as n gets large) is to continually switch among support functions (moving from P_α to P_β to P_γ to …) in a way that revises the pre-evidential prior probability of h_i downward towards 0. And even then, he can only avoid having the posterior probability for h_j approach 0 for each current support function, as he switches among them, by continually switching to new support functions at a rate that keeps the revised priors P_ε[h_i | b] for h_i diminishing towards 0 at least as q.ickly as the likelihood ratios diminish towards 0 (with increasing n). For, suppose, on the contrary, that P[eⁿ | h_j·b·cⁿ] / P[eⁿ | h_i·b·cⁿ] approaches 0 faster than sequence P_ε[h_i | b], for changing P_ε and increasing n — i.e., approaches 0 faster in the sense that (P[eⁿ | h_j·b·cⁿ] / P[eⁿ | h_i·b·cⁿ]) / P_ε[h_i | b] goes to 0, for changing P_ε and increasing n. Then, we have (P[eⁿ | h_j·b·cⁿ] / P[eⁿ | h_i·b·cⁿ]) / P_ε[h_i | b] > (P[eⁿ | h_j·b·cⁿ] / P[eⁿ | h_i·b·cⁿ]) · (P_ε[h_j | b] / P_ε[h_i | b]) = P_ε[h_j | b·cⁿ·eⁿ] / P_ε[h_i | b·cⁿ·eⁿ]. So, P_ε[h_j | b·cⁿ·eⁿ] / P_ε[h_i | b·cⁿ·eⁿ] must still go to 0, for changing P_ε and increasing n; and so must P_ε[h_j | b·cⁿ·eⁿ].

For a thorough presentation of the most prominent Bayesian convergence results and a discussion of their weaknesses see (Earman, 1992, Ch. 6). However, Earman does not discuss the convergence theorems under consideration here.

19. In scientific contexts the most prominent kind of case where data may fail to be result-independent is where some q.antity of past data helps tie down the numerical value of a parameter not completely specified by the hypothesis at issue, but where the value of this parameter influences the likelihoods of outcomes of lots of other experiments. Such hypotheses effectively contain disjunctions of more specific hypotheses, where each distinct disjunct is a version of the original hypothesis, but with a specific value filled in for the parameter. Evidence that “fills in the value” for the parameter for the original, less specific hypothesis just amounts to evidence that refutes (via likelihood ratios) those specific disjuncts in it that possess incorrect parameter values. So, for the purposes of inductive logic, in many cases where it is helpful to treat evidence as composed of result-independent chunks one may decompose the less specific composite disjunctive hypotheses into their more specific disjuncts. For each of them, result-independence should be satisfied (when the data is appropriately chunked, as discussed in the text).

20. Technically, suppose that O_k can be further “subdivided” into more outcome-descriptions by replacing o_kv with two “parts”, o_kv^* and o_kv^#, to produce new outcome space O_k^* = {o_k1,…,o_kv^*,o_kv^#,…,o_kw}, where P[o_kv^*·o_kv^# | h_i·b·c_k] = 0 and P[o_kv^* | h_i·b·c_k] + P[o_kv^# | h_i·b·c_k] = P[o_kv | h_i·b·c_k]; and suppose similar relationships hold for h_j. Then the new EQI* (based on O_k^*) is greater than or equal to EQI (based on O_k); and EQI^* > EQI just in case at least one of the new likelihood ratios, e.g., P[o_kv^* | h_i·b·c_k] / P[o_kv^* | h_j·b·c_k], differs in value from the “undivided” outcome's likelihood ratio, P[o_kv | h_i·b·c_k] / P[o_kv | h_i·b·c_k].

21. The likely rate of convergence will almost always be much faster than the worst case bound provided by Theorem 2. To see the point more clearly, let's look at a very simple example. Suppose h_i says that a certain bent coin has a propensity for “heads” of 2/3 and h_j says the propensity is 1/3. Let the evidence stream consist of outcomes of tosses. In this case the average EQI equals the EQI of each toss, which is 1/3; and the smallest possible likelihood ratio occurs for “heads”, which yields the value γ = ½. So, the value of the lower bound given by Theorem 2 for the likelihood of getting an outcome sequences with a likelihood ratio below ε (for h_j over h_i) is

1 − (1/n)(log ½)²/((1/3) + (log ε)/n)² = 1 − 9/(n·(1 + 3(log ε)/n)².

Thus, according to the theorem, the likelihood of getting an outcome sequence with a likelihood ratio less than ε = 1/16 (=.06) when h_i is true and the number of tosses is n = 52 is at least .70; and for n = 204 tosses the likelihood is at least .95.

To see how much lower then necessary the lower bound provided by the theorem really is, consider what the usual binomial distribution for the coin tosses in this example implies about the likely values of the likelihood ratios. The likelihood ratio for exactly k “heads” in n tosses is ((1/3)^k (2/3)^n−k) / ((2/3)^k (1/3)^n−k) = 2^n−2k, which we want to have a value less than ε. A bit of algebra yields that to get a likelihood ratio below ε, the percentage of “heads” must be kn > ½ − ½(log ε)/n. Using the normal approximation to the binomial distribution (with mean = 2/3 and variance = (2/3)·(1/3)/n) the actual likelihood of obtaining an outcome sequence having a likelihood ratio less than ε is given by

Φ[(mean − (½ − ½(log ε)/n))/(variance)^½] = Φ[(1/8)^½n^½(1 + 3(log ε)/n)]

(where Φ[x] gives the value of the standard normal distribution from −∞ to x). Now let ε = 1/16 (= .0625), as before. So the actual likelihood of obtaining a stream of outcomes with likelihood ratio this small when h_i is true and the number of tosses is n = 52 is Φ[1.96] > .975, whereas the lower bound given by Theorem 2 was .70. And if the number of tosses is increased to n = 204, the likelihood of obtaining an outcome sequence with a likelihood ratio this small (i.e., ε = 1/16) is Φ[4.75] > .999999, whereas the lower bound from Theorem 2 for this likelihood is .95. Indeed, to actually get a likelihood of .95 that the evidence stream will produce a likelihood ratio less than ε >.06, the number of tosses actually needed is only n = 43 tosses, rather than the 204 tosses required by the bound given by the theorem. (Note: These examples employ “identically distributed” trials — repeated tosses of a coin — as an illustration. But Convergence Theorem 2 applies much more generally. It applies to any evidence sequence, no matter how diverse the probability distributions for the various experiments or observations in the sequence, and regardless of whether the outcomes are independent.)

22. It should now be clear why the boundedness of EQI above 0 is important. Convergence Theorem 2 applies only when EQI[cⁿ | h_i/h_j | b] > −(log ε)/n. But this requirement is not a strong assumption. For, the Nonnegativity of EQI Theorem shows that the empirical distinctness of two hypotheses on a single possible outcome suffices to make the average EQI positive for the whole sequence of experiments. So, given any small fraction ε > 0, the value of −(log ε)/n (which has to be greater than 0) will eventually become smaller than EQI, provided that the degree to which the hypotheses are empirical distinct for the various observations c_k does not on average degrade too much as the length n of the evidence stream increases. This seems a reasonable condition on the empirical distinctness of hypotheses. And Convergence Theorem 2 relies on it.

When the possible outcomes for the sequence of observations are independent and identically distributed, Theorems 1 and 2 essentially reduce to L. J. Savage's Bayesian Convergence Theorem [Savage, pg. 52-54]. Independent, identically distributed outcomes most commonly result from the repetition of identical statistical experiments (e.g., repeated tosses of a coin, or repeated measurements of q.antum systems prepared in identical states). In such experiments a hypothesis will specify the same likelihoods for the same kinds of outcomes from one observation to the next. So EQI will remain constant as the number of experiments, n, increases. However, Theorems 1 and 2 are much more general. They continue to hold when the sequence of observations encompasses completely unrelated experiments that have different distributions on outcomes — experiments that have nothing in common except their connection to the hypotheses they test.

23. In many scientific contexts this is the best we can hope for. But it still provides a very reasonable representation of inductive support. Let's consider, for example, the hypothesis that the land masses of Africa and South America separated and drifted apart over the eons, the drift hypothesis, as opposed to the hypothesis that the continents have fixed positions acquired when the earth first formed and cooled and contracted, the contraction hypothesis. One may not be able to determine anything like precise likelihoods, on each hypothesis, that the shape of the east coast of South America should match the shape of the west coast of Africa as closely as it in fact does, or that the geology of the two coasts should match up so well, or that the plant and animal species on these distant continents should be as similar as they are. But experts may readily agree that each of these observations is much more likely on the drift hypothesis than on the contractionist hypothesis. Jointly these observations should constitute very strong evidence for drift over contraction.

Historically, the case of continental drift is more complicated. Geologists tended to largely dismiss this evidence until the 1960s. This was not because the evidence wasn't strong in its own right. Rather, this evidence was found unconvincing because it was not sufficient to overcome prior plausibility considerations that made the drift hypothesis seem extremely implausible — much less plausible that the contraction hypothesis. The problem was that there seemed to be no plausible mechanism by which drift might occur. It was argued, q.ite plausibly, that no known force could push or pull the continents apart, and that the less dense continental material could not push through the denser material that makes up the ocean floor. These plausibility objections were overcome when a plausible mechanism was articulated — i.e. the continental crust floats atop molten material and moves apart as convection currents in the molten material carry it along. The case was pretty well clinched when evidence for this mechanism was found in the form of “spreading zones” containing alternating strips of magnetized material at regular distances from mid-ocean ridges. The magnetic alignments of materials in these strips corresponds closely to the magnetic alignments found in magnetic materials in dateable sedimentary layers at other locations on the earth. These magnetic alignments indicate time periods when the direction of earth's magnetic field has reversed. And this gave geologists a way of measuring the rate at which the sea floor might spread and the continents move apart. Although geologists may not be able to determine anything like precise values for the likelihoods of any of this evidence on each of the alternative hypotheses, the evidence is universally agreed to be much more likely on the drift hypothesis than on the contractionist alternative. And, with the emergence of the possibility of a plausible mechanism, the drift hypothesis no longer seems so overwhelmingly implausible prior to the evidence, either. So, the value of a likelihood ratio may be objective or public enough, even when precise values for individual likelihoods are not available.

24. To see the point of the last clause, suppose it were violated. That is, suppose there are possible outcomes for which the likelihood ratio is very near 1 for just one of the two support functions. Then, even a very long sequence of such outcomes might leave the likelihood ratio for one support function almost equal to 1, while the likelihood ratio for the other support function goes to an extreme value. If that can happen for support functions in a class that represent likelihoods for various scientists in the community, then the empirical contents of the hypotheses is either too vague or too much in dispute for meaningful empirical evaluation to occur.

25. Even if there are a few directionally controversial likelihood ratios, where P_α says the ratio is somewhat greater than 1, while and P_β assigns a value somewhat less than 1, these may not greatly effect the trend of P_α and P_β towards agreement on the refutation and support of hypotheses provided that the controversial ratios are not so extreme as to overwhelm the stream of other evidence on which the likelihood ratios do directionally agree.