Library / Consequences of Dichotomization

Authors	Valerii Fedorov Frank Mannino Rongmei Zhang
Year	2008
DOI	10.1002/pst.331
Links	Link
Tags	Mathematics Statistics Science Audit Dichotomization

Reference

Valerii Fedorov, Frank Mannino, Rongmei Zhang “Consequences of dichotomization” (2008) // Pharmaceutical Statistics. Publisher: Wiley. Vol. 8. No 1. Pp. 50–61. DOI: 10.1002/pst.331

Bib

@Article{fedorov2008,
  title = {Consequences of dichotomization},
  volume = {8},
  issn = {1539-1612},
  url = {http://dx.doi.org/10.1002/pst.331},
  doi = {10.1002/pst.331},
  number = {1},
  journal = {Pharmaceutical Statistics},
  publisher = {Wiley},
  author = {Fedorov, Valerii and Mannino, Frank and Zhang, Rongmei},
  year = {2008},
  month = {apr},
  pages = {50–61}
}

Quotes (2)

Efficiency Loss of Dichotomization

Dichotomization is the transformation of a continuous outcome (response) to a binary outcome. This approach, while somewhat common, is harmful from the viewpoint of statistical estimation and hypothesis testing. We show that this leads to loss of information, which can be large. For normally distributed data, this loss in terms of Fisher’s information is at least $1-2/\pi$ (or 36%). In other words, 100 continuous observations are statistically equivalent to 158 dichotomized observations. The amount of information lost depends greatly on the prior choice of cut points, with the optimal cut point depending upon the unknown parameters. The loss of information leads to loss of power or conversely a sample size increase to maintain power. Only in certain cases, for instance, in estimating a value of the cumulative distribution function and when the assumed model is very different from the true model, can the use of dichotomized outcomes be considered a reasonable approach.

Page 50

Dichotomization Should Be Avoided in Most Cases

The knowledge of losing information from dichotomizing a continuous outcome is nothing new. However, many previous writings report on the optimal choice of cut points, which depends upon the parameters we wish to estimate. If we are lucky, the chosen cut point is near the optimal point, but the consequences of dichotomizing become more dire as we deviate from the optimal point. We focus our study on the evaluation of losses caused by dichotomization given cut points. While the analysis of dichotomized outcomes may be easier, there are no benefits to this approach when the true outcomes can be observed and the ‘working’ model is flexible enough to describe the population at hand. Thus, dichotomization should be avoided in most cases. Only when we wish to estimate a CDF value, our working model poorly approximates reality, and our sample size is large will the biasedness of model-based estimators overpower the improvement in variance. In this case, the dichotomized estimator may lead to better results, but further study-specific consideration is needed. We also want to emphasize that while analysis should be done using actual outcomes, some aspects of this analysis can be reported on a dichotomized scale.

Page 59