From the article:
There are two major types of data torturing. In the first, which I term "opportunistic" data torturing, the perpetrator simply pores over the data until a "significant" association is found between variables and then devises a biologically plausible hypothesis to fit the association. The second, or "Procrustean," type of data torturing is performed by deciding on the hypothesis to be proved and making the data fit the hypothesis (Procrustes, a robber in Greek mythology, made all his victims fit the length of his bed by stretching or cutting off their legs).
Clues to Data Torturing:
Data torturing can rarely be proved. There are, however, clues that should arouse the reader’s suspicion.
In the case of opportunistic data torturing (the search for chance associations), the reader must ask, Is this a chance finding with an a posteriori hypothesis concocted to give it credibility, or is this an honest hypothesis-generating study? Tukey points out the need for exploratory studies using “theoretical insights and exploration of past data.” Hypothesis-generating studies (sometimes referred to somewhat contemptuously as “fishing expeditions”) should be identified as such. To warrant further exploration, findings from such studies should be biologically plausible. If the fishing expedition catches a boot, the fishermen should throw it back, not claim that they were fishing for boots. If a finding has good data from animal studies or related human studies to support it, it is unlikely to have resulted from opportunistic data torturing. If it has neither biologic plausibility nor supporting data, it should be viewed with a jaundiced eye.
Similarly, an honest exploratory study should indicate how many comparisons were made. Although there is disagreement about how (or even whether) to adjust for multiple comparisons, most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. That data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of “significant” results in the context of how many comparisons were made. In the occupational-exposure study described earlier, nine “significant” findings were reported. Given that 158 comparisons were made, eight of those nine results could easily have occurred by chance.
Identifying Procrustean data torturing (in which the data are made to fit the hypothesis) also requires asking the right questions:
Why were study subjects dropped? One recent study of the health effects of exposure to heat dropped one of the four categories of exposure, changing a nonsignificant effect to a significant effect. One should suspect data torturing whenever subjects are dropped without a clear reason, or when a large proportion of subjects are excluded for any reason.
Does the classification of exposure and disease make sense? Statements such as “we studied those with at least five years of exposure to lead smelters, or those with blood lead levels of 50 micrograms per deciliter or higher” should raise questions. Why were the data on subjects with shorter exposure or less elevated lead levels not reported? Is it because they did not fit the hypothesis?
Are cutoff points for laboratory studies reasonable and customary? Some of the bolder data torturers will argue that the clustering of subjects’ test values at the upper range of normal is evidence of a pathologic state. Others will take advantage of the lack of a well-established cutoff point to select the point that makes their data produce the most significant results. A study of AIDS could use various CD4 cell counts as cutoff points, then report the one that shows the most impressive effect. The presence of a dose-response relation is evidence that the reported effect is genuine, not the result of arbitrary classification. If a diabetic woman’s risk of miscarriage increases 5 percent for each 1 percent increase in her glycosylated hemoglobin level, the association is not likely to be due to data torturing. The key is that the effect is consistent across a wide range of values.
Is the rationale for the subgroup analyses convincing? If a drug works only in women over 60 years of age, the savvy reader should suspect a chance finding. Remember that two sexes, multiple age groups, and different clinical features such as stages of disease make it possible for the investigators to examine the data in many different ways.
Is there a clear biologic mechanism that could account for the effect in one subgroup but not in others, or were multiple comparisons made in order to produce positive results? “The study drug produced significantly increased survival at 18 months” may mean that there were no significant differences in survival at any of the other five periods examined.
In the same vein, it is important to ask whether the data have been censored. As I noted above, looking only at the group that survived at least three months after starting treatment may disguise the fact that the drug under study caused a substantial number of deaths in the first three months.