Imagine you work with some data and assume that the underlying distribution is approximately normal. In such cases, the data analysis typically involves non-robust statistics like the mean and the standard deviation. While these metrics are highly efficient under normality, they make the analysis procedure fragile: a single extreme value can corrupt all the results. You may not expect any significant outliers, but you can never be 100% sure. To avoid unexpected surprises and ensure the reliability of the results, it may be tempting to automatically exclude all outliers from the collected samples. While this approach is widely adopted, it conceals an essential part of the obtained data and can lead to fallacious conclusions.
Let me recite a classic story about ozone holes, which is typically used to illustrate the danger of blind outlier removal:
The discovery of the ozone hole was announced in 1985 by a British team working on the ground with “conventional” instruments and examining its observations in detail. Only later, after reexamining the data transmitted by the TOMS instrument on NASA’s Nimbus 7 satellite, was it found that the hole had been forming for several years. Why had nobody noticed it? The reason was simple: the systems processing the TOMS data, designed in accordance with predictions derived from models, which in turn were established on the basis of what was thought to be “reasonable”, had rejected the very (“excessively”) low values observed above the Antarctic during the Southern spring. As far as the program was concerned, there must have been an operating defect in the instrument.
— R. Kandel, Our Changing Climate (1991)
According to the cited fragment, the research team had enough data to detect the ozone holes, but the software automatically discarded this information because it recognized unusual values as outliers that should be removed.
However, there is another version of this story that claims that there was no automatic outlier removal of the TOMS data. According to the letter from Dr. Richard McPeters (Head of the Ozone Processing Team at NASA) to Dr. Pukelsheim ([Pukelsheim1990]):
This myth was the result of a statement made by one of my colleagues in reply to a question during an interview on the science program NOVA in which he was asked why NASA did not discover the ozone hole first. He was not directly involved in ozone processing at that time and his answer was not correct.
In this letter, Dr. Richard McPeters explains that NASA engineers were aware of the anomalous data collected by NASA’s Nimbus-7 satellite (the software did not throw the outliers away, it reported them properly). In order to investigate untypical values and verify their correctness, they compared them with the data from the South Pole Dobson ground station. Unfortunately, the Dobson values for the relevant period of time were “erroneous and uncorrectable,” so it was impossible to verify if the satellite data were anomalous due to an instrument error or due to an actual physical phenomenon.
A similar story can be found in [Bhartia2009]:
Unfortunately, some in the scientific community have derived incorrect conclusions from this experience. It has been widely but incorrectly reported that the TOMS team missed the discovery of the ozone hole since “they rejected the data and discovered it only after the publication of Farman et al. 1985 paper” (see Seinfeld & Pandis 1998, page 189, for a typical example). In fact when the South Pole Dobson data from October 1984 became available to the TOMS team in late 1984, well before the publication of Farman et al. paper, there was no doubt on the validity of satellite total ozone retrievals. (Ironically, October 1983 total ozone values initially reported by the S. Pole station showed normal ozone values that were much larger than that being reported by TOMS. These data were later withdrawn by the station.) The correct conclusion that should be derived from this experience is that remote sensing techniques, such as those used by nadir-viewing satellites, depend critically on information generated by in-situ techniques because of their dependence on prior information. When the retrievals are pushed beyond the limits set by prior information the data become suspect. To report such data without proper validation is scientifically irresponsible. The TOMS team followed the correct approach even though it caused several months delay in reporting the data to the scientific community.
While the original story is supposed to be a myth, there is a reason why it became so popular: the automatic removal of outliers is regrettably widespread in data analysis. This is understandable: it is much easier to discard inconvenient data rather than properly examine it. However, such a shortcut can lead to unfortunate consequences and wrong conclusions. A proper data analysis procedure should provide not only a clear definition of outliers but also explain their origin.
If outliers appear in the data, it is a good idea to investigate them manually rather than automatically remove them so that the outliers do not interfere with non-robust analysis procedures. Of course, it can be impractical to involve humans every time when we detect extreme values (especially if we have to process a lot of data). Once the source of outliers is determined, we can try to automate the outlier examination process by implementing logic that recognizes special types of outliers. However, if we fail to match the obtained extreme values to one of the known patterns, such a situation should be flagged for manual investigation. Using such an approach, we can iteratively extend our knowledge base and support handling different types of exceptional situations. Remember that outliers are not inconvenient values; they are an essential part of the data, which may provide valuable insights.
The author thanks Paul Velleman for valuable discussions.
Pukelsheim, Friedrich. “Robustness of statistical gossip and the Antarctic ozone hole.” (1990)
Bhartia, Pawan K. “Role of satellite measurements in the discovery of stratospheric ozone depletion.” In Twenty Years of Ozone Decline: Proceedings of the Symposium for the 20th Anniversary of the Montreal Protocol, pp. 183-189. Springer Netherlands (2009)
- Gavin Schmidt. “What did NASA know? and when did they know it?” (2017)
- Rob J Hyndman. “Omitting outliers” (2016)
- Farhana Ismail. “Investigating outliers” (2021)