Library / Statistics Done Wrong: The Woefully Complete Guide

Author	Alex Reinhart
Year	2015
Links	GoodReads
Tags	Mathematics Statistics Science Audit Statistical Literacy
Rating

Quotes (13)

Statistical Reforms

In recent years there have been many advocates for statistical reform, and naturally there is disagreement among them on the best method to address these problems. Some insist that p values, which I will show are frequently misleading and confusing, should be abandoned altogether; others advocate a “new statistics” based on confidence intervals. Still others suggest a switch to new Bayesian methods that give more-interpretable results, while others believe statistics as it’s currently taught is just fine but used poorly.

Page 5

Counterintuitiveness of Significance Testing

This is a counterintuitive feature of significance testing: if you want to prove that your drug works, you do so by showing the data is inconsistent with the drug not working.

Page 9

Noise and Real Effects

A statistically insignificant difference could be nothing but noise, or it could represent a real effect that can be pinned down only with more data.

Page 9

Analytically Calculating Power Can Be Difﬁcult or Downright Impossible

Math is another possible explanation for why power calculations are so uncommon: analytically calculating power can be difﬁcult or downright impossible. Techniques for calculating power are not frequently taught in intro statistics courses. And some commercially available statistical software does not come with power calculation functions. It is possible to avoid hairy mathematics by simply simulating thousands of artiﬁcial datasets with the effect size you expect and running your statistical tests on the simulated data. The power is simply the fraction of datasets for which you obtain a statistically signiﬁcant result. But this approach requires programming experience, and simulating realistic data can be tricky.

Page 20

Truth Inflation

This effect, known as truth inflation, type M error (M for magnitude), or the winner’s curse, occurs in fields where many researchers conduct similar experiments and compete to publish the most “exciting” results: pharmacological trials, epidemiological studies, gene association studies (“gene A causes condition B”), and psychological studies often show symptoms, along with some of the most-cited papers in the medical literature.

Page 24

Truth Inflation and Groundbreaking Effect Sizes in top-ranked Journals

Consider also that top-ranked journals, such as Nature and Science, prefer to publish studies with groundbreaking results—meaning large effect sizes in novel fields with little prior research. This is a perfect combination for chronic truth inflation. Some evidence suggests a correlation between a journal’s impact factor (a rough measure of its prominence and importance) and the factor by which its studies overestimate effect sizes. Studies that produce less “exciting” results are closer to the truth but less interesting to a major journal editor. brembs2013 siontis2011

Page 25

Journal’s Impact Factor and Overestimation of Effect Sizes

Some evidence suggests a correlation between a journal’s impact factor (a rough measure of its prominence and importance) and the factor by which its studies overestimate effect sizes. Studies that produce less “exciting” results are closer to the truth but less interesting to a major journal editor.

Page 25

Sphygmomanometer

Or perhaps I’m worried that my sphygmomanometers are not perfectly calibrated, so I measure with a different one each day.
I just wanted an excuse to use the word sphygmomanometer.

Page 32

Critique of McClintock's Study of Human Menstrual Cycles

McClintock’s study of human menstrual cycles went something like this:
Find groups of women who live together in close contactfor instance, college students in dormitories.
Every month or so, ask each woman when her last menstrual period began and to list the other women with whom she spent the most time.
Use these lists to split the women into groups that tend to spend time together.
For each group of women, see how far the average woman’s period start date deviates from the average.
Small deviations would mean the women’s cycles were aligned, all starting at around the same time. Then the researchers tested whether the deviations decreased over time, which would indicate that the women were synchronizing. To do this, they checked the mean deviation at ﬁve different points throughout the study, testing whether the deviation decreased more than could be expected by chance.
Unfortunately, the statistical test they used assumed that if there was no synchronization, the deviations would randomly increase and decrease from one period to another. But imagine two women in the study who start with aligned cycles. One has an average gap of 28 days between periods and the other a gap of roughly 30 days. Their cycles will diverge consistently over the course of the study, starting two days apart, then four days, and so on, with only a bit of random variation because periods are not perfectly timed. Similarly, two women can start the study not aligned but gradually align.
For comparison, if you’ve ever been stuck in traffic, you’ve probably seen how two turn signals blinking at different rates will gradually synchronize and then go out of phase again. If you’re stuck at the intersection long enough, you’ll see this happen multiple times. But to the best of my knowledge, there are no turn signal pheromones.
So we would actually expect two unaligned menstrual cycles to fall into alignment, at least temporarily. The researchers failed to account for this effect in their statistical tests.
They also made an error calculating synchronization at the beginning of the study: if one woman’s period started four days before the study began and another’s started four days after, the difference is only eight days. But periods before the beginning of the study were not counted, so the recorded difference was between the fourth day and the ﬁrst woman’s next period, as much as three weeks later.
These two errors combined meant that the scientists were able to obtain statistically significant results even when there was no synchronization effect outside what would occur without pheromones.

Page 36

How to Lie with Smoking Statistics

Attempting to capitalize on Huff’s respected status, the tobacco industry commissioned him to testify before Congress and then to write a book, tentatively titled ‘How to Lie with Smoking Statistics’, covering the many statistical and logical errors alleged to be found in the surgeon general’s report. Huff completed a manuscript, for which he was paid more than $9,000 > (roughly $60,000 in 2014 dollars) by tobacco companies and which was positively reviewed by University of Chicago statistician (and paid tobacco industry consultant) K.A. Brownlee. Although it was never published, it’s likely that Huff’s friendly, accessible style would have made a strong impression on the public, providing talking points for watercooler arguments.

Your Eyeball

Your eyeball is not a well-deﬁned statistical procedure.

Page 62

Tentative Exploratory Findings

But aimlessly exploring data means a lot of opportunities for false positives and truth inflation. If in your explorations you find an interesting correlation, the standard procedure is to collect a new dataset and test the hypothesis again. Testing an independent dataset will filter out false positives and leave any legitimate discoveries standing. (Of course, you’ll need to ensure your test dataset is sufficiently powered to replicate your findings.) And so exploratory findings should be considered tentative until confirmed.
If you don’t collect a new dataset or your new dataset is strongly related to the old one, truth inflation will come back to bite you in the butt.

Page 63

On the Triumph of Mediocrity in Business

A final, famous example dates back to 1933, when the field of mathematical statistics was in its infancy. Horace Secrist, a statistics professor at Northwestern University, published The Triumph of Mediocrity in Business, which argued that unusually successful businesses tend to become less successful and unsuccessful businesses tend to become more successful: proof that businesses trend toward mediocrity. This was not a statistical artifact, he argued, but a result of competitive market forces. Secrist supported his argument with reams of data and numerous charts and graphs and even cited some of Galton’s work in regression to the mean. Evidently, Secrist did not understand Galton’s point.

Page 68

Backlinks (4)

"Eager-Beaver Researchers" (1967) by Paul E Meehl et al. 1
Evaluation of Watermelons Texture Using Their Vibration Responses (2013) by Rouzbeh Abbaszadeh et al. 1
N–3 Fatty Acids in Patients with Multiple Cardiovascular Risk Factors (2013) 1 1
Efficacy of n-3 Polyunsaturated Fatty Acids and Feasibility of Optimizing Preventive Strategies in Patients at High Cardiovascular risk (2010) 2