Statistics Done Wrong: The Woefully Complete Guide
Excerpts
In recent years there have been many advocates for statistical reform, and naturally there is disagreement among them on the best method to address these problems. Some insist that p values, which I will show are frequently misleading and confusing, should be abandoned altogether; others advocate a “new statistics” based on confidence intervals. Still others suggest a switch to new Bayesian methods that give more-interpretable results, while others believe statistics as it’s currently taught is just fine but used poorly.
— Page 5
This is a counterintuitive feature of significance testing: if you want to prove that your drug works, you do so by showing the data is inconsistent with the drug not working.
— Page 9
A statistically insignificant difference could be nothing but noise, or it could represent a real effect that can be pinned down only with more data.
— Page 9
Math is another possible explanation for why power calculations are so uncommon: analytically calculating power can be difficult or downright impossible. Techniques for calculating power are not frequently taught in intro statistics courses. And some commercially available statistical software does not come with power calculation functions. It is possible to avoid hairy mathematics by simply simulating thousands of artificial datasets with the effect size you expect and running your statistical tests on the simulated data. The power is simply the fraction of datasets for which you obtain a statistically significant result. But this approach requires programming experience, and simulating realistic data can be tricky.
— Page 20
This effect, known as truth inflation, type M error (M for magnitude), or the winner’s curse, occurs in fields where many researchers conduct similar experiments and compete to publish the most “exciting” results: pharmacological trials, epidemiological studies, gene association studies (“gene A causes condition B”), and psychological studies often show symptoms, along with some of the most-cited papers in the medical literature.
— Page 24
Some evidence suggests a correlation between a journal’s impact factor (a rough measure of its prominence and importance) and the factor by which its studies overestimate effect sizes. Studies that produce less “exciting” results are closer to the truth but less interesting to a major journal editor.
— Page 25
Or perhaps I’m worried that my sphygmomanometers are not perfectly calibrated, so I measure with a different one each day.
- I just wanted an excuse to use the word sphygmomanometer.
— Page 32
Your eyeball is not a well-defined statistical procedure.
— Page 62
But aimlessly exploring data means a lot of opportunities for false positives and truth inflation. If in your explorations you find an interesting correlation, the standard procedure is to collect a new dataset and test the hypothesis again. Testing an independent dataset will filter out false positives and leave any legitimate discoveries standing. (Of course, you’ll need to ensure your test dataset is sufficiently powered to replicate your findings.) And so exploratory findings should be considered tentative until confirmed.
If you don’t collect a new dataset or your new dataset is strongly related to the old one, truth inflation will come back to bite you in the butt.
— Page 63
A final, famous example dates back to 1933, when the field of mathematical statistics was in its infancy. Horace Secrist, a statistics professor at Northwestern University, published The Triumph of Mediocrity in Business, which argued that unusually successful businesses tend to become less successful and unsuccessful businesses tend to become more successful: proof that businesses trend toward mediocrity. This was not a statistical artifact, he argued, but a result of competitive market forces. Secrist supported his argument with reams of data and numerous charts and graphs and even cited some of Galton’s work in regression to the mean. Evidently, Secrist did not understand Galton’s point.
— Page 68