9
This article was downloaded by: [Stony Brook University] On: 29 October 2014, At: 00:01 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of the American Statistical Association Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uasa20 Rejoinder Richard Royall Published online: 17 Feb 2012. To cite this article: Richard Royall (2000) Rejoinder, Journal of the American Statistical Association, 95:451, 773-780, DOI: 10.1080/01621459.2000.10474269 To link to this article: http://dx.doi.org/10.1080/01621459.2000.10474269 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Rejoinder

  • Upload
    richard

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rejoinder

This article was downloaded by: [Stony Brook University]On: 29 October 2014, At: 00:01Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical AssociationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uasa20

RejoinderRichard RoyallPublished online: 17 Feb 2012.

To cite this article: Richard Royall (2000) Rejoinder, Journal of the American Statistical Association, 95:451, 773-780, DOI:10.1080/01621459.2000.10474269

To link to this article: http://dx.doi.org/10.1080/01621459.2000.10474269

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Rejoinder

Comment A. C. M. WONG

The author is to be congratulated for his interesting article. In particular, I would like to comment on three issues.

Equation (3) has only first-order convergence rate o ( ~ - ~ / ~ ) . It gives very poor approximation, especially when the sample size is small or when the underlying distri- bution is far away from normal distribution (see Pierce and Peters 1992 for some examples). Barndorff-Nielsen (1991) proposed a modified likelihood ratio statistic. Will this mod- ification improve the universal bound on the probability of misleading evidence?

Calculating Fisher information can be difficult in many situations. However, one can replace Fisher information by the observed information; that is, replace I ( 0 ) by j (@, where

and 8 is the maximum likelihood estimate of 6. If this is the case, then the maximum probability of misleading evidence given in Section 4 is achieved at the two alternatives,

Note that the margin of error remains the same regardless of the true value 61. Will the rest of the results in this article still hold?

In Section 5 the author suggested using the profile likeli- hood. As he noted, the profile likelihood is not a true like- lihood, and thus the universal bound, does not apply. Fraser and Reid (1988) and McCullagh and Tibshirani (1990) pro-

posed two different adjustments to the profile likelihood such that the resulting adjusted likelihoods will have prop- erties closer to a true likelihood. This may be one of the ways to ensure that the universal bound is applicable. This may not be feasible, because the adjusted likelihoods are, in general, very complicated. Alternatively, Pierce and Pe- ters (1992) discussed various ways to compensate for the loss of information due to the elimination of the nuisance parameters. In particular, a correction factor due to Skov- gaard (1987) requires only the change of the inverse of the information matrix b(6 ,y ) . This correction is very easy to apply. Will this ensure that the universal bound holds? Fur- thermore, the modified likelihood ratio statistics proposed by Barndorff-Nielsen (1991) can be applied, which certainly will improve the results obtained from using the profile like- lihood ratio statistic.

ADDITIONAL REFERENCES Barndorff-Nielsen, 0. E. (1990, “Modified Signed Log-Likelihood Ratio,”

Biometrika, 78, 557-563. Fraser, D. A. S. , and Reid, N. (1988), “On Conditional Inference for a Real

Parameter: A Differential Approach on the Sample Space,” Biometrika,

McCullagh, P., and Tibshirani, R. (1990), “A Simple Method for the Ad- justment of Profile Likelihoods,” Journal of the Royal Statistical Society, Ser. B, 52, 325-344.

Pierce, D. A., and Peters, D. (1992), “Practical Use of Higher-Order Asymptotics for Multiparameter Exponential Families” (with discus- sion), Journal of the Royal Statistical Society, ser. B, 54, 701-737.

Skovgaard, I. M. (1987), “Saddlepoint Expansions for Conditional Distri- butions,” Journal of Applied Probability, 24, 875-887.

75,251-264.

A. C . M. Wong is Associate Professor, Department of Mathematics and Statistics, York University, North York, Ontario, Canada M3J 1P3 (E-mail: august @ mathstat. yorku. ca).

@ 2000 American Statistical Association Journal of the American Statistical Association

September 2000, Vol. 95, No. 451, Theory and Methods

Rejoinder Richard ROYALL

I thank the discussants for their interest and comments. Evans observes that the article “is not necessarily an ar- gument for the likelihood as the appropriate characteriza- tion of statistical evidence, although ultimately that is the question of real interest. It could be viewed as simply con- sidering some consequences of such a definition.” Showing that it has reasonable consequences is, of course, an essen- tial part of the argument for the law of likelihood, and the

the appropriate characterization of statistical evidence. But Evans is certainly correct in suggesting that these results are not by themselves intended to be a compelling argu- ment for that claim. He and the other discussants express various reservations that this article does not attempt to ad- dress in detail, but that are examined elsewhere. Because the general question of the appropriate characterization of

properties of likelihood functions exhibited in the universal bound and in the bump function provide strong support for @ 2000 American Statistical Association

Journal of the American Statistical Association the claim that the law of likelihood does indeed represent September 2000, Vol. 95, No. 451, Theory and Methods

773

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4

Page 3: Rejoinder

774 Journal of the American Statistical Association, September 2000

statistical evidence, although not the subject of the article, is central to the discipline of statistics, I welcome the op- portunity to discuss the issues here.

In Statistical Evidence: A Likelihood Paradigm (1997) (hereinafter RR97), I examined well-known limitations and inconsistencies in conventional statistical theory (and meth- ods) with respect to the concept of statistical evidence. There I presented arguments and examples showing that for the purpose of interpreting and communicating observed data as evidence (i.e., for showing what the data say), the standard statistical methods for estimation (point and in- terval) and hypothesis testing (including “tests of signifi- cance”) are not only lacking in theoretical support, but also are incompatible with the fundamental principle on which a sound theory of statistical evidence must rest, the law of likelihood. Although as Forster and Sober (forthcom- ing) have noted, it may turn out to be impossible to prove the law of likelihood from more primitive principles, for the simple reason that the likelihood concept is “an item of rock-bottom epistemology,” in RR97 I presented three types of arguments in its support: intuitive, formal, and op- erational. Those arguments and analyses go far toward re- solving the more general concerns expressed by the present discussants.

What is the Question?

Most of those concerns spring from misunderstandings that are the inevitable consequence of our (statisticians’) chronic lack of clarity about what we are trying to accom- plish when we engage in “statistical inference.” I am con- vinced that such misunderstandings are the essence of the frequentist/Bayesian controversy, and that they represent the main obstacle to both of those factions’ appreciation of the law of likelihood for what it is-the fundamental principle of statistical reasoning. Let me try to explain.

First, what I mean by “statistical evidence” has two com- ponents: observations and a probability model. The model comprises a set of probability distributions, and the observa- tions are conceptualized as having been generated accord- ing to one of them. These two components, observations and a probability model, define an instance of statistical evidence (Birnbaum 1962), whereas the concept of statisti- cal evidence is essentially defined by the law of likelihood, which explains how such evidence is interpreted. In RR97 I gave the following simple example of statistical evidence and used it to illustrate some critical distinctions and a basic principle.

A subject is given a diagnostic test for the presence of a disease, and the result is positive. The observed test result becomes statistical evidence when it is conceptualized as a realization of a random variable whose probability distri- bution is unknown. In this example, one supposes that the random variable has one of two specified probability distri- butions, depending on whether the subject has the disease. In particular, if the disease is present, then the probabil- ity of a positive test result is .94, but if it is not, then the probability is only .02.

The subject’s physician might ask three important ques- tions about this statistical evidence:

1. Should this observation lead me to believe that this person has the disease?

2. Does this observation justify my acting as if he has the disease? . 3. Is this test result evidence that he has the disease?

These are examples of the generic questions that define three distinct problem-areas of statistics:

1. What should I believe? (Bayesian inference) 2. What should I do? (Decision theory) 3. What do these data say? (Evidential analysis)

For the answers to the physician’s questions, we look to the principles and theory of these problem-areas. The answers turn out to be

1. Maybe 2. Maybe 3. Yes.

Bayesian inference explains that to answer question 1, the statistical evidence must be combined with the prior state of uncertainty about this subject’s disease status. For example, if the test is used in a mass screening program, and the subject is simply one of the positives, then the answer to question 1 is “yes” if the disease prevalence is high enough and “no” if it is not.

Thus question 1 cannot be answered in terms of the statis- tical evidence alone. It requires additional input describing the state of uncertainty about the disease before the sta- tistical evidence was observed. Similarly, decision theory explains the critical role of the loss function in determining the answer to question 2, which again cannot be answered in terms of the statistical evidence alone.

In contrast, the law of likelihood says that under this probability model (asserting that the probability of a posi- tive test is either .94 or .02, depending on whether or not the disease is present), the positive test result is evidence that the subject has the disease, regardless of the probability of disease before the test, regardless of what actions might be contemplated, and regardless of the costs of such actions.

Thus, depending on factors that are independent of this statistical evidence (such as prior probabilities and loss structures), it might be correct to conclude that it is im- probable that the subject has the disease, and it might be best not to treat him for the disease. But under the specified probability model, it is never right to interpret the positive test result as evidence that the subject does not have the disease. Why? Because that would violate the fundamental principle of statistical reasoning, the principle that governs the interpretation of statistical evidence per se-the law of likelihood.

This simple example serves to define the important sta- tistical problem area of evidential analysis, and to distin- guish it from other important problem areas. It provides the answer to Kalbfleisch’s question of “what statistical problems can be addressed?’ The likelihood paradigm ad- dresses problems in the area defined by question 3, prob-

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4

Page 4: Rejoinder

Royall: Rejoinder 775

lems of interpreting data as evidence. The theory behind the standard frequentist statistical methods, such as those for hypotheses testing and interval estimation, is based largely on the Neyman-Pearson paradigm, which explic- itly formulates the corresponding problems in terms of be- havior or decision making, so that in theory those meth- ods address question 2. But the methods are typically used in applications aimed at question 3 (RR97, chaps. 2 and 3). Thus the problems addressed by the likelihood paradigm are problems not addressed by conventional sta- tistical theory, but where statistical methods nevertheless find their most important scientific applications; that is, problems requiring the objective interpretation of empir- ical data as evidence. As Birnbaum (1962) put it, “when informative inference [“evidential analysis”] is recognized as a distinct problem-area of mathematical statistics, it is seen to have a scope including some of the . . . applica- tions customarily subsumed in the problem-areas of point or set estimation, testing hypotheses, and multi-decision procedures.”

Because the law of likelihood is incompatible with the theory behind standard statistical methods, it provides only indirect guidance for their use. It suggests that when fre- quentist methods are used for the purpose of interpreting data as evidence, those based on likelihood functions (such as likelihood-based confidence intervals and maximum like- lihood estimates) are apt to appear more “reasonable” or “appropriate” than others. But the law makes it clear (via the likelihood principle) that for this purpose, conventional frequentist methods, including those that are likelihood- based, are not only lacking in theoretical support-they are quite generally inappropriate (RR97, secs. 2.3 and 3.7). Be- cause Kalbfleisch’s focus is on the potential role of likeli- hood functions within the current frequentist paradigm (for example, he expresses concern that likelihood intervals are not confidence intervals, and that the profile likelihood does not provide “a good method for estimation and testing” in certain well-known circumstances), his disappointment is understandable.

The Strength of Statistical Evidence

Although no one seems to dispute the interpretation of the positive diagnostic test result, where the likelihood ratio is P(+ldisease)/P(+lno disease) = 47, as strong evidence in favor of the disease hypothesis, many express doubts about whether a likelihood ratio of 47 really represents ev- idence of the same strength in different problems, as the law of likelihood says it does. The concern is that the like- lihood ratio, although perhaps representing an important aspect of the statistical evidence, does not tell the whole story. (Both Evans and Kalbfleisch express reservations of this sort.) These doubts are usually assuaged by the follow- ing observation: The definition of conditional probability implies that a likelihood ratio of 47 supporting A over B represents evidence precisely strong enough to increase the probability ratio of A to B by a factor of 47, regardless of what A and B represent, and regardless of their initial

probabilities:

This has nothing to do with the relative complexities (or dimensions) of A and B. Thus the short answer to Kalbfleisch’s query “is it envisaged that the choice of k is dimension dependent?’ is “no.”

The Role of Probabilities of Weak and Misleading Evidence

Suspicion that the likelihood ratio does not tell the whole story, though perhaps diminished by the preceding argu- ment, persists. One intractable source of this suspicion is the belief that probabilities such as p values measure the strength of statistical evidence. Of course, this belief is deeply ingrained in current statistical thinking, but it is, I believe, quite clearly wrong (RR97, chap. 3). It is illustrated by Evans’s suggestion that, in addition to the observed like- lihood ratio, the probability of observing a misleading like- lihood ratio that large or larger [his expression (l)] should be “part of the summary of statistical evidence.”

To see what is wrong with this suggestion, consider an- other diagnostic test that is much less sensitive than the first one. For this test, the probability of a positive result is only .47 if the subject has the disease, and .01 if he does not. (For the first test, the corresponding probabilities were .94 and .02.) A positive test again gives a likelihood ratio of 47 in favor of disease. Now if the subject does not have the disease, then the probability of a (misleading) likelihood ratio this large in favor of disease is .01, only half of the value for the first test. Does this mean that a positive result on the second test is stronger evidence in favor of disease than a positive result on the first one, or that a positive result on the second test is “less likely to be misleading,” is “more reliable” in some sense, or that it warrants more “confidence”? It does not.

A positive result on the second test is equivalent, as ev- idence about presence or absence of disease, to a positive result on the first. Whatever the probability ratio before testing, P(disease)/P(no disease), the effect of a positive result on either test is exactly the same, to increase it by a factor of 47. In any decision problem where the losses depend on whether or not the disease is present, the effect of a positive result on either test is exactly the same.

Before testing, it is true that the first test has the greater probability of producing misleading evidence of strength 47 in favor of disease (.02 vs. .01). But after testing, those two probabilities are irrelevant. An observed positive result is misleading evidence if and only if the subject does not have the disease, and the probability that it is misleading is P(no diseasel+), which is the same for both tests. That is, an observed positive result is no more likely to be misleading if it comes from one test than if it comes from the other. Both tests’ positive results are equally reliable, and both are worthy of the same degree of confidence that they do not represent misleading evidence. (For closer examination of this sort of example, see RR97, sec. 4.5.)

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4

Page 5: Rejoinder

776 Journal of the American Statistical Association, September 2000

The probabilities of weak and of misleading evidence are certainly relevant to the planning of studies. In particular, explicit consideration of the probability of weak evidence, as a function of sample size, should be a routine part of the planning process. But after the experiment is completed, these probabilities are not relevant for interpreting the re- sults. Evans observes that the article “stops short of making a recommendation of what one should report beyond the ra- tio in an application.” To represent the evidence, in the con- text of a specific working model, what one should report are all of the likelihood ratios; that is, the likelihood function. As Birnbaum (1962) put it, the likelihood principle implies that “reports of experimental results in scientific journals should in principle be descriptions of likelihood functions.” To see what the data say, one looks at the likelihood func- tion.

Verbal Description of the Strength of Statistical Evidence

If one accepts the point that a likelihood ratio of 47 rep- resents evidence of the same strength in different applica- tions, then the question arises of how to characterize that strength verbally. Is this evidence reasonably described as “weak” or “very strong”? Because the strength is the same in all models, one can learn what language is appropriate by considering simple ones.

My suggested benchmarks of 8 and 32 (a likelihood ra- tio of 8 is “fairly strong” and a ratio of 32 is “strong” ev- idence) come from an urn experiment. There are two urns, one containing only white balls and one containing equal numbers of white and black balls. A ball is drawn from one of the urns, and it is observed to be white. This is ob- viously “weak” statistical evidence in favor of the all white urn. Its likelihood ratio is 2. Suppose that another white ball is drawn, then another, and then another. Most people view the evidence as “fairly strong” after the third consecutive white ball (LR = 8) and as clearly “strong” after the fifth (LR = 32). A likelihood ratio of 47 is stronger evidence than 5 consecutive white balls, but not as strong as 6.

The likelihood ratio measures the strength of statistical evidence on a continuum. That strength does not jump into a different category when the likelihood ratio crosses some numerical threshold. A likelihood ratio of 8 is fairly strong evidence. So are likelihood ratios of 7.5 and 10. Because for the normal distribution, the likelihood ratio supporting the sample mean, p = 2, over the value at the end of the two-sided 95% confidence interval, p = 2 + 1.96a/fi, is 6.83, most statisticians would also describe a likeli- hood ratio of that size as “fairly strong” evidence. (But if they characterize the evidence supporting the sample mean over the value at the end of the one-sided 95% interval, p = 2 + 1.645a/fi, as “fairly strong” evidence, then they might be exaggerating-the likelihood ratio is only 3.9, the same strength as 2 white balls in the urn scheme.)

These benchmarks provide a useful, if rough, way of calibrating likelihood ratios. Attempts at much greater re- finement will probably prove futile. And efforts to find the smallest threshold at which it is appropriate to describe the evidence as “strong” are positively rnisguided-we are not

engaged in hypothesis testing here. (Evans’s concern about “ambiguity surrounding the choice of a particular k as the cutoff for saying that strong evidence does or does not exist for H1 over H2” suggests that his thinking is firmly locked in the Neyman-Pearson paradigm.) Refer RR97, chapters 2 and 3.

“Disproving” the Law of Likelihood: Reductio ad Absurdum

Another simple example, which some have interpreted as showing that the law of likelihood can lead to absurd conclusions, provides a framework for examining most of the discussants’ other reservations and objections. Suppose that I shuffle an ordinary-looking deck of playing cards, turn over the top card, and see that it is a four of clubs. Now consider two hypotheses, one stating that the deck is a normal one ( H N ) , and the other stating that it consists of 52 fours of clubs ( H I ) . The normal-deck hypothesis im- plies that the observation has probability 1/52, whereas H1 implies that the probability is 1. Thus according to the law of likelihood, this observation constitutes evidence in favor of the hypothesis that the deck comprises 52 fours of clubs over the hypothesis that it is a normal deck. This evidence, with a likelihood ratio of 52, is actually slightly stronger than that in the positive diagnostic test, whose likelihood ratio was 47.

Two interpretations of this example are as follows:

The hypothesis HI is a ridiculous one that was made up just to fit this observation. Although the law of likelihood says that this is strong evidence supporting H1 over H N , no rational person would be convinced that there are 52 fours of clubs, or be moved to act as if that were true. Therefore, this example disproves the law of likelihood (reductio ad absurdum). If the deck is a normal one, then this observation rep- resents misleading evidence. But the same is true of every possible observation-we can always construct a “trick deck” hypothesis like HI (52 cards identical to the one observed) that is better supported than HN by a factor of 52. Therefore, the probability of ob- serving misleading evidence of strength 52 or greater is not constrained by the “universal bound,” 1/52, but equals l!

Let us see why these are in fact misinterpretations:

a. Hypothesis H I is not ridiculous. It is just vastly less probable than H N , in light of most peoples’ experi- ence. (Contrary to Kalbfleisch’s impression stated in his third comment, the hypotheses need not be pre- specified. The fact that H1 was not prespecified, but suggested by the data, might indicate that its prior probability is very small. However, that does not pre- clude measuring the statistical evidence supporting Hl over H N , and it has no effect on the strength of that evidence.) The fact that one is not persuaded that H1 is true, and might reasonably continue to act as if the deck is normal, does not mean that the observation is not strong evidence in favor of H1 over H N . It sim- ply shows that this evidence is not strong enough to

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4

Page 6: Rejoinder

Royall: Rejoinder 777

overcome the extreme improbability of H1 relative to HN a priori. The basic error here is in confusing the three questions from the first example. There is no in- consistency in finding that (a) the data are strong ev- idence supporting H1 over H N , (b) one still believes H N to be more probable than HI, and (c) if one had to choose one from a collection of hypotheses about the composition of the deck, with specific penalties for erroneous choices, then one’s best choice would be H N .

A statement of the form “The data support the infer- ence that the deck comprises 52 fours of clubs,” both blurs the distinction between the three basic questions and fails to recognize the relativity of support. Be- cause it can easily be interpreted as answering any or all of the three questions, which can have different answers, it creates the illusion of inconsistency. His use of such statements leads Wolpert to unfounded judgements regarding “sensible inference.”

I used this example in the classroom last year, ac- tually drawing a four of clubs from a well-shuffled deck. When everyone had agreed to interpretation a ( H I is ridiculous), I reshuffled and again drew a four of clubs. After the same thing happened on the third and fourth draws, no one was surprised when I turned the deck over and revealed that there were indeed 52 fours of clubs.

b. The concept of evidence embodied in the law of like- lihood is a relative one. Observations are not evidence for or against a single hypothesis-rather, they are evidence supporting one hypothesis vis-A-vis another. Likewise, whether observations do or do not repre- sent misleading evidence is defined only with respect to pairs of hypotheses. When HN is true, the four of clubs is strong misleading evidence supporting HI over H N . But the same observation is also overwhelm- ing evidence supporting HN over each of the 51 other (false) trick-deck hypotheses (H2: 52 aces of hearts, H3: 52 deuces of hearts, etc.). Is it misleading evi- dence? That depends on the alternative being consid- ered. It is misleading evidence with respect to Hl and H N . It is not misleading evidence with respect to H2 and H N . It is not misleading evidence with respect to H3 and H N . It is not “misleading evidence against HN,” and it is not “misleading evidence.” I see no justification for Kalbfleisch’s concern (his comment 3) about misinterpretation of the terms: if the likeli- hood ratio is near one, then the evidence vis-8-vis that pair of distributions is weak, (regardless of whether either was prespecified). And if the likelihood ratio is 52, then the evidence is strong.

The most common objections to evidential analysis based on the law of likelihood arise from misinterpretations of the two types just seen:

a. Failing to distinguish between the statistical evidence and prior beliefs or other factors that affect the plau-

sibility of the hypotheses or the viability of decision strategies, independently of the statistical evidence (RR97, sec. 1.5). (The three questions, again.)

b. Mistakenly characterizing data as “evidence against H ’ whenever some alternative H I can be found that is better supported than H . . . or characterizing such data as “evidence that H1 is true.”

A Treasure Hunt in Flatland

This charming example cited by Evans illustrates inter- pretation a. It involves a scenario describing how a random path determines where a treasure is to be buried. The path to the treasure is traced by a string, but after the treasure is buried, the end of the string is moved, again randomly. As the problem was originally formulated by Stone (1976), the final position of the string’s end is observed, and this observation produces a flat likelihood function on the four treasure sites that are compatible with the observation (in- dicating that each of the four is equally well supported). But Stone gave a rule for choosing one of the four sites at which to dig, and showed that one who applies it will find the treasure three times out of four! The standard Bayesian resolution of this paradox (Berger and Wolpert 1988) entails noting that the scenario conveys more information than is represented in the flat likelihood function. When that addi- tional information is expressed in a prior probability distri- bution on paths to the treasure, the posterior probabilities accord with Stone’s rule for digging. A “pure likelihood” solution treats the location of the treasure, rather than the entire path to it, as the parameter of interest. Under this for- mulation, the string’s observed path gives a likelihood that is not flat-the location chosen by Stone’s rule is the one best supported by the observation. It is better supported by a ratio of 9 to 1 over each of the other three, so that equal prior probabilities on the four sites are converted to poste- rior probabilities of (3/4, 1/12, 1/12, 1/12). The problem given in the verbal scenario can be formulated in various ways. One way overlooks some of the information and pro- duces a paradox. Others avoid the paradox by including that information, either in the form of a prior probability distri- bution or as part of the statistical evidence.

Composite Hypotheses

The example of one card drawn from a deck has a third common misinterpretation that is a variation on the second one:

b’. As an alternative to HN (normal deck), consider HA, the composite hypothesis asserting that the deck is one of the 52 trick ones. If the deck is normal, every possi- ble observation gives a likelihood ratio of 52 in favor of HA. Therefore, when HN is true the probability of observing misleading evidence supporting H A over HN by a factor of 52 or greater is 1.

What is wrong with this interpretation is the statement “If the deck is normal, then every possible observation gives a likelihood ratio of 52 in favor of HA.’’ Under the “normal deck” hypothesis, every possible observation, such as a four

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4

Page 7: Rejoinder

778 Journal of the American Statistical Association, September 2000

of clubs, has a probability of 1/52. But what is the proba- bility under HA? It is indeterminate-HA implies only that the probability is either 0 or 1. There is no likelihood ratio that measures the support for HN vis-a-vis HA. One of the distributions that make up the composite HA is better sup- ported than H N , but HN is much better supported than any of the rest. To the question “Which of the two hypotheses, HN or HA, is better-supported by this observation?,” the law of likelihood is silent.

This illustrates the fact that we can compare support for composite hypotheses only under very special circum- stances (such as when every distribution in one hypothesis is better supported than any distribution in the other). In a multiparameter model, this means that there is no like- lihood function for a single parameter. This is why, as I stated in the article, “the problem of how to represent ev- idence in the presence of nuisance parameters . . . has no general solution.”

Both Evans and Wolpert point to this conclusion and cite it as a shortcoming of the likelihood paradigm. Both assert that Bayesian methodology overcomes this limitation. Thus it seems that Bayesian methods accomplish what the law of likelihood says cannot be done-they provide a valid way to measure the support for composite hypotheses. What nei- ther discussant acknowledges is that this is accomplished by simply changing the problem so that the law of likelihood applies!

Bayesian methods do not solve the problem of measuring support for composite hypotheses. They replace the com- posite hypotheses by simple ones, then measure the sup- port for them according to the law of likelihood; that is, by likelihood ratios, which in this context are called “Bayes factors .”

This is how it works in the card example. (We take a closer look at the Bayesian solution to the nuisance param- eter problem in the next section.) We have seen that there is no solution to the problem of properly characterizing the observation as evidence supporting HN versus HA and mea- suring the strength of that evidence. The Bayesian reaction is a pragmatic one-to replace the unsolvable problem by one that we can solve. A Bayesian might replace the com- posite H A with a different alternative, HB, stating not only that the deck used in our demonstration is one of the 52 trick ones but also that if it is, then each of the 52 has the same prior probability of being the one that is used.

This new alternative, H g , determines a unique proba- bility distribution for the observable random variable-a simple hypothesis. Whereas HA implies only that the prob- ability of observing a four of clubs is either 0 or 1, H g implies that the probability is precisely 1/52:

P ( 4 4 1 H ~ ) = P(4/15244)P(524)IHg) = 1/52.

Now the law of likelihood applies. The evidence sup- porting HB over HN is measured by the likelihood ratio, P(461HB)/P(441HN) = 1, which means that the obser- vation represents evidence with no strength at all (neutral evidence) for HB vis-a-vis H N . This is clearly correct, be- cause with respect to this experiment, the hypotheses HB and HN are indistinguishable. They imply the same proba-

bility distribution for the observable random variable. They are empirically equivalent hypotheses.

Nuisance Parameters

We have used the card example to show that there is no cogent general solution to the problem of quantifying ev- idence supporting H I vis-8-vis HZ when either hypothesis is composite. In the case of a vector parameter (0, y), this means that in general there is no likelihood function for 0 alone. If there were, then its ratio L(01)/L(02) would mea- sure the support for the hypotheses H I : 0 = 01 versus Hz: 0 = 02, which, because y is variable, are both composite. But there are some special situations where this “nuisance parameter problem” does have a solution, and it is instruc- tive to examine them.

When the likelihood function factors, L(0,y) K

L1 (0)L2(y), the parameters 0 and y are orthogonal, and the first factor, Ll(0), is the likelihood function for 0 (RR97, chap. 7). The universal bound on the probability of mislead- ing evidence applies to L l ( Q and the ratio Ll(O1)/Ll(O,) is the factor by which the evidence changes the prior prob- ability ratio of 01 to 02 (as long as the prior does not stip- ulate that 0 and y are dependent). If the parameters are not orthogonal, but the model can be reparameterized in terms of 0 and some other nuisance parameter, y*, which is orthogonal to O,L(Q,y*) cx L1(0)L2(y*), then Ll(0) is the likelihood function for 0. An important but often over- looked advantage of the profile likelihood function is that in cases like this, where the parameters are orthogonaliz- able, Ll(0) is the profile likelihood. That is, whenever the nuisance parameter problem can be definitively solved, the profile likelihood function is the solution. (Because the pro- file likelihood is invariant under reparameterization, it is not necessary to actually find an orthogonal reparameterization (0,y) + (0,y*). As long as one exists, the profile likeli- hood L,(0) = L(0, T ( 0 ) ) automatically gives the solution, L l ( 0 ) (Tsou and Royal1 1995).)

On the other hand, every nuisance parameter problem, orthogonalizable or not, has not one but many Bayesian “solutions.” [Bayesians have invested enormous effort in trying represent statistical evidence with a posterior prob- ability distribution (instead of a likelihood function), and their work on eliminating nuisance parameters has taken place largely within that context. Much less attention has gone into the present problem, where the likelihood func- tion is considered to be not just a means to a posterior prob- ability distribution, but a meaningful end in itself.] Every prior probability density g(0, y) gives a likelihood func- tion for 0: L g ( 0 ; g ) m fs(x;O) = J f ( 0 , y ) g ( y l O ) d y . How can we determine which, if any, of these Bayesian “solu- tions” are valid? The subjective Bayesian answer is “If your personal prior beliefs about the parameters are accurately represented by g(0, y), then Lg(0; g ) is valid for you.” Sci- ence, looking to statistics for objective ways to represent and quantify empirical evidence, has not been satisfied with this answer, and other answers have been proposed. Some Bayesians, seeking an answer that is not an explicit function of personal beliefs, have tried in vain to justify choosing one

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4

Page 8: Rejoinder

Royall: Rejoinder 779

particular prior or another. [“Ignorance” and “noninforma- tive” priors were in vogue for a while. Now interest has turned to “reference” priors (Berger and Bernard0 1992).] With no cogent theory to guide their choice of a prior, many Bayesians have taken the course of Wolpert, using whatever prior gives an answer they judge to be “sensible.”

For a model where the parameters are orthogonalizable we do not have to rely on personal judgements of this sort. We can determine whether a particular Bayesian solution, L B ( ~ ; g), is a valid representation of the evidence about 8 under the original model-it is if and only if L g ( 8 ; g ) is the profile likelihood. And this occurs only when the prior distribution g(8,y) implies that the orthogonal parameters, 8 and y*, are independent.

Profile likelihoods do not represent a general solution to the unsolvable problem in evidential analysis that is pre- sented by nuisance parameters, nor do Bayesian integrated likelihoods. Both deserve serious consideration as ad hoc solutions in models that are not orthogonalizable. But if real likelihoods (such as marginal or conditional ones) that are free of nuisance parameters can be found, they are likely to be better. (Familiar examples are the likelihood for a normal variance based on the marginal chi-squared distribution of the sample variance, and the likelihood for the odds ratio in a two-binomial model based on the conditional distribution given the total number of successes.) One reason for this is that all real likelihoods satisfy the universal bound on the probability of misleading evidence.

When X is generated from a distribution in the origi- nal model, a Bayesian integrated likelihood function does not necessarily satisfy the universal bound. That function, L B ( ~ ; g), is a true likelihood only with respect to the model fg(z; 8 ) = J f(z; 8, -y)g(yl8) dy and can be guaranteed to satisfy the universal bound only when the observable ran- dom variable X is actually generated from fg(z;8) for some fixed 8. When X is actually distributed according to f(z; el, rl), the probability of observing a sample for which the Bayes factor exceeds k ,

can exceed l /k . In the article I noted the well-known fact that profile

likelihoods are not true likelihoods in general, and gave a simple example [one observation from the normal (y,8) distribution] showing that if this fact is ignored, then one can be led to grossly inappropriate interpretations of statis- tical evidence. I then showed that nevertheless, in an im- portant special class of problems it might be acceptable to treat profile likelihoods as if they were true Iikelihoods, us- ing their ratios as measures of statistical evidence. Specif- ically, I showed that in a fixed model indexed by a finite- dimensional parameter [ X I , X Z , . . . , X, are iid with proba- bility density function f(.; 8 ,7 ) ] profile likelihoods behave, in a very important way, just like true likelihoods-in large samples the probability of observing a large ratio in favor of O2 over dl, when is true, is given by the bump function.

Wolpert offers more examples to reinforce the point that outside the class of models where I prove that profile like-

lihoods must behave well, they can behave badly. But he claims that some of his examples show that they can behave badly even within that class: “Profile likelihood, with or without an asymptotic bound, remains misleading.” I think that this claim rests on misinterpretation of the examples, as I will now explain.

The Need for Theory

Statistics needs a sound theory of evidence, because in- tuitive judgments, even those of highly capable and expe- rienced statisticians, about what the data say about what is “misleading” and what is “sensible,” can be quite wrong. This is nicely illustrated in Wolpert’s closing paragraph, where he judges that “In each of these examples the profile likelihood function is misleading as a summary of evidence about 8.”

Let us look at what he apparently considers to be one of the most compelling of the examples (as indicated by his introduction with the words “even worse”). The model is within the class where the large-sample behavior of profile likelihoods is described by the bump function. The example is the profile likelihood for the coefficient of variation in a normal distribution.

Figure 1 shows the profile likelihood for a sample of 10 observations with mean .43 and standard deviation 1.08. What Wolpert judges to be obviously wrong is the behav- ior at the extremes-the convergence as 8 + fcc to a (nonzero) limit.

Because this a problem that is not orthogonalizable, there is no definitive solution to which one can compare the pro- file likelihood. How then can one determine whether it re- ally is “misleading as a summary of evidence about @’? Is it possible for a true likelihood function to exhibit the same behavior?

Figure 2 shows the (true) likelihood function for the re- ciprocal of the mean in a normal distribution model with unit variance, based on the same sample as that in Figure 1. The model is N(p, l), so there is no nuisance parame- ter. The likelihood function exp[--n(2 - p)’//a] represents the evidence about the mean p, and the likelihood function e~p[-n(%-1/8)~/2] represents the evidence about 8 = 1/p. This latter function is the one shown in Figure 2. (Because

-20 -10 0 10 20 30 40

Coefficient of Variation

Figure 1. Profile Likelihood for Normal Coefficient of Variation. The sample (n = 10) has mean .43 and standard deviation 1.08.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4

Page 9: Rejoinder

780 Journal of the American Statistical Association, September 2000

I 8 I t I I I

-20 -10 0 10 20 30 40

1 I expected value

Figure 2. Likelihood for Reciprocal of Normal Mean (Variance = 1). The sample is the same as in Figure 1.

the variance equals 1, 6 is the coefficient of variation under this model.)

Here is a true likelihood with the same characteristics as the profile likelihood in Figure 1. Is it misleading as a summary of the evidence? Of course not. It shows precisely what the data say under the model. In terms of the mean p, these observations represent very strong evidence in favor of the best-supported value, 3 = .43, versus values where 1pl is huge, but only weak evidence versus values of p near 0. (The likelihood ratio at p = .43 versus p = 0 is 2.49, barely stronger than one white ball from the urn.) Equiv- alently, in terms of 6 = l/p, the observations represent very strong evidence in favor of the best-supported value, 1 /Z = 2.34, versus values near 0 (corresponding to huge values of Ipl), but only weak evidence versus values where 161 is very large (corresponding to p near 0).

The 1/k likelihood region, (6 ; L(6) 2 l/k}, consists of all of the values of 6 that are consistent with the observa- tions in the sense that no other value is better supported by a factor greater than k . The 1/8 and 1/32 regions shown by the horizontal lines in Figure 2 may seem odd at first, but

they make perfect sense. In fact, they have the same form as the standard exact confidence region for this problem- the excluded interval comprises all values of 6 for which the corresponding value of p = l / 6 is excluded from the interval Z * m o / f i . The 1/8 likelihood interval for p excludes values p < -.22 and 1.07 < p. The equivalent val- ues -4.60 = l/(-.22) < 0 < 1/(1.07) = .93 are excluded from the likelihood interval for 0.

Under the two-parameter normal model, the profile like- lihood function for the coefficient of variation o / p (Fig. 1) looks the way it does because its shape is determined largely by the evidence about l /p (Fig. 2). For this particular sam- ple (the data are actually a random sample from the normal distribution with p = .1 and o = 1, so that both 0 and o/p equal lo), what the data say appears to be accurately represented in Figure 1.

Another Treasure Hunt

I thank Wong for pointing out that recent powerful work aimed primarily at improving conventional statistical meth- ods (giving better approximate confidence intervals and more accurate approximations for their coverage proba- bilities) is potentially applicable to finding and improving bounds on the probability of misleading evidence, as well as to making large-sample approximations in general more relevant to the evidential analysis of small to moderate sam- ples.

ADDITIONAL REFERENCES Berger, J., and Bernardo, J. M. (1992), “On the Development of Reference

Priors,” in Bayesian Statistics 4, eds. J . M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith, New York: Oxford University Press.

Berger, J. O., and Wolpert, R. L. (1988), The Likelihood Principle (2nd ed.), ed. S. S. Gupta, Hayward, CA: Institute of Mathematical Statistics.

Forster, M., and Sober, E. (forthcoming), “Why Likelihood?,” in The Na- ture of ScientiJc Evidence: Empirical, Statistical, and Philosophical Considerations, eds. M. L. Taper and S. R. Lele, Chicago: University of Chicago Press.

Vieiand, V. J., and Hodge, S. E. (1998), Review of Statistical Evidence: A Likelihood Paradigm by R. Royall, Annals of Human Genetics, 63, 283-289.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

0:01

29

Oct

ober

201

4