Library / P values, Hypothesis tests, and likelihood: Implications for Epidemiology of a Neglected Historical Debate


Steven N Goodman “P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate” (1993) // American Journal of Epidemiology. Publisher: Oxford University Press. Vol. 137. No 5. Pp. 485–496. DOI: 10.1093/oxfordjournals.aje.a116700


  title = {P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate},
  author = {Goodman, Steven N},
  journal = {American Journal of Epidemiology},
  volume = {137},
  number = {5},
  pages = {485--496},
  year = {1993},
  publisher = {Oxford University Press},
  doi = {10.1093/oxfordjournals.aje.a116700}

Quotes (2)

Scientific Method

The originators of the statistical frameworks that underlie modern epidemiologic studies recognized that their methods could not be interpreted properly without an understanding of their philosophical underpinnings. Neyman held that inductive reasoning was an illusion and that the only meaningful parameters of importance in an experiment were constraints on the number of statistical “errors” we would make, defined before an experiment. Fisher rejected mechanistic approaches to inference, believing in a more flexible, inductive approach to science. One of Fisher’s developments, mathematical likelihood, fit into such an approach. The p value, which Fisher wanted used in a similar manner, invited misinterpretation because it occupied a peculiar middle ground. Because of its resemblance to the pretrial a error, it was absorbed into the hypothesis test framework. This created two illusions: that an “error rate” could be measured after an experiment and that this posttrial “error rate” could be regarded as a measure of inductive evidence. Even though Fisher, Neyman, and many others have recognized these as fallacies, their perpetuation has been encouraged by the manner in which we use the p value today. One consequence is that we overestimate the evidence for associations, particularly with p values in the range of 0.001-0.05, creating misleading impressions of their plausibility. Another result is that we minimize the importance of judgment in inference, because its role is unclear when postexperiment evidential strength is thought to be measurable with preexperiment “error-rates.” Many experienced epidemiologists have tried to correct these problems by offering guidelines about how p values should be used. We may be more effective if, in the spirts of Fisher and Neyman, we instead focus on clarifying what p values mean, and on what we mean by the “scientific method.”

Neyman-Pearson vs. Fisher

It is not generally appreciated that the p value, as conceived by R. A. Fisher, is not compatible with the Neyman-Pearson hypothesis test in which it has become embedded. The p value was meant to be a flexible inferential measure, whereas the hypothesis test was a rule for behavior, not inference. The combination of the two methods has led to a reinterpretation of the p value simultaneously as an “observed error rate” and as a measure of evidence. Both of these interpretations are problematic, and their combination has obscured the important differences between Neyman and Fisher on the nature of the scientific method and inhibited our understanding of the philosophic implications of the basic methods in use today. An analysis using another method promoted by Fisher, mathematical likelihood, shows that the p value substantially overstates the evidence against the null hypothesis. Likelihood makes clearer the distinction between error rates and inferential evidence and is a quantitative tool for expressing evidential strength that is more appropriate for the purposes of epidemiology than the p value.