# Abandonment - Essays

The aim of science is to establish facts, as accurately as possible. It is therefore crucially important to determine whether an observed phenomenon is real, or whether it’s the result of pure chance. If you declare that you’ve discovered something when in fact it’s just random, that’s called a false discovery or a false positive. And false positives are alarmingly common in some areas of medical science.

In 2005, the epidemiologist John Ioannidis at Stanford caused a storm when he wrote the paper ‘Why Most Published Research Findings Are False’, focusing on results in certain areas of biomedicine. He’s been vindicated by subsequent investigations. For example, a recent article found that repeating 100 different results in experimental psychology confirmed the original conclusions in only 38 per cent of cases. It’s probably at least as bad for brain-imaging studies and cognitive neuroscience. How can this happen?

The problem of how to distinguish a genuine observation from random chance is a very old one. It’s been debated for centuries by philosophers and, more fruitfully, by statisticians. It turns on the distinction between induction and deduction. Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.

What matters to a scientific observer is how often you’ll be wrong if you claim that an effect is real, rather than being merely random. That’s a question of induction, so it’s hard. In the early 20th century, it became the custom to avoid induction, by changing the question into one that used only deductive reasoning. In the 1920s, the statistician Ronald Fisher did this by advocating tests of statistical significance. These are wholly deductive and so sidestep the philosophical problems of induction.

Tests of statistical significance proceed by calculating the probability of making our observations (or the more extreme ones) if there were no real effect. This isn’t an assertion that there* is* no real effect, but rather a calculation of what *would**be expected if* there were no real effect. The postulate that there is no real effect is called the *null hypothesis, *and the probability is called the *p-*value. Clearly the smaller the *p-*value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. All you have to do is to decide how small the *p-*value must be before you declare that you’ve made a discovery. But that turns out to be very difficult.

The problem is that the *p-*value gives the right answer to the wrong question. What we really want to know is *not* the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there *is* a real effect – that the hypothesis is true – given the observations. And that is a problem of induction.

Confusion between these two quite different probabilities lies at the heart of why *p-*values are so often misinterpreted. It’s called the error of the transposed conditional. Even quite respectable sources will tell you that the *p-*value is the probability that your observations occurred by chance. And that is plain wrong.

Subscribe to Aeon’s Newsletter

Suppose, for example, that you give a pill to each of 10 people. You measure some response (such as their blood pressure). Each person will give a different response. And you give a different pill to 10 other people, and again get 10 different responses. How do you tell whether the two pills are really different?

The conventional procedure would be to follow Fisher and calculate the probability of making the observations (or the more extreme ones) if there were no true difference between the two pills. That’s the *p-*value, based on deductive reasoning. *P-*values of less than 5 per cent have come to be called ‘statistically significant’, a term that’s ubiquitous in the biomedical literature, and is now used to suggest that an effect is real, not just chance.

But the dichotomy between ‘significant’ and ‘not significant’ is absurd. There’s obviously very little difference between the implication of a *p-*value of 4.7 per cent and of 5.3 per cent, yet the former has come to be regarded as success and the latter as failure. And ‘success’ will get your work published, even in the most prestigious journals. That’s bad enough, but the real killer is that, if you observe a ‘just significant’ result, say *P* = 0.047 (4.7 per cent) in a single test, and claim to have made a discovery, the chance that you are wrong is at least 26 per cent, and could easily be more than 80 per cent. How can this be so?Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement

For one, it’s of little use to say that your observations would be rare if there were no real difference between the pills (which is what the *p-*value tells you), unless you can say whether or not the observations would *also* be rare when there *is* a true difference between the pills. Which brings us back to induction.

The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since.

Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right *before* any observations have been made (the ‘prior probability’). Bayes’s theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’.

These intangible probabilities persuaded Fisher that Bayes’s approach wasn’t feasible. Instead, he proposed the wholly deductive process of null hypothesis significance testing. The realisation that this method, as it is commonly used, gives alarmingly large numbers of false positive results has spurred several recent attempts to bridge the gap.

There is one uncontroversial application of Bayes’s theorem: diagnostic screening, the tests that doctors give healthy people to detect warning signs of disease. They’re a good way to understand the perils of the deductive approach.

In theory, picking up on the early signs of illness is obviously good. But in practice there are usually so many false positive diagnoses that it just doesn’t work very well. Take dementia. Roughly 1 per cent of the population suffer from mild cognitive impairment, which might, but doesn’t always, lead to dementia. Suppose that the test is quite a good one, in the sense that 95 per cent of the time it gives the right (negative) answer for people who are free of the condition. That means that 5 per cent of the people who don’t have cognitive impairment will test, falsely, as positive. That doesn’t sound bad. It’s directly analogous to tests of significance which will give 5 per cent of false positives when there is no real effect, if we use a *p-*value of less than 5 per cent to mean ‘statistically significant’.

But in fact the screening test is not good – it’s actually appallingly bad, because 86 per cent, not 5 per cent, of all positive tests are false positives. So only 14 per cent of positive tests are correct. This happens because most people don’t have the condition, and so the false positives from these people (5 per cent of 99 per cent of the people), outweigh the number of true positives that arise from the much smaller number of people who have the condition (80 per cent of 1 per cent of the people, if we assume 80 per cent of people with the disease are detected successfully). There’s a YouTube video of my attempt to explain this principle, or you can read my recent paper on the subject.

the number of *false *positives in the tests where there is no real effect outweighs the number of *true* positives that arise from the cases in which there is a real effect

Notice, though, that it’s possible to calculate the disastrous false-positive rate for screening tests only because we have estimates for the prevalence of the condition in the whole population being tested. This is the prior probability that we need to use Bayes’s theorem. If we return to the problem of tests of significance, it’s not so easy. The analogue of the prevalence of disease in the population becomes, in the case of significance tests, the probability that there is a real difference between the pills before the experiment is done – the prior probability that there’s a real effect. And it’s usually impossible to make a good guess at the value of this figure.

An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a *P* = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of *false* positives in the tests where there is no real effect outweighs the number of *true* positives that arise from the cases in which there is a real effect.

In general, though, we don’t know the real prevalence of true effects. So, although we can calculate the *p-*value, we can’t calculate the number of false positives. But what we can do is give a *minimum* value for the false positive rate. To do this, we need only assume that it’s not legitimate to say, before the observations are made, that the odds that an effect is real are any higher than 50:50. To do so would be to assume you’re more likely than not to be right before the experiment even begins.

If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent. Any lower prevalence will result in an even higher false positive rate.

The upshot is that, if a scientist observes a ‘just significant’ result in a single test, say *P* = 0.047, and declares that she’s made a discovery, that claim will be wrong at least 26 per cent of the time, and probably more. No wonder then that there are problems with reproducibility in areas of science that rely on tests of significance.

What is to be done? For a start, it’s high time that we abandoned the well-worn term ‘statistically significant’. The cut-off of *P* < 0.05 that’s almost universal in biomedical sciences is entirely arbitrary – and, as we’ve seen, it’s quite inadequate as evidence for a real effect. Although it’s common to blame Fisher for the magic value of 0.05, in fact Fisher said, in 1926, that *P* = 0.05 was a ‘low standard of significance’ and that a scientific fact should be regarded as experimentally established only if repeating the experiment ‘*rarely fails* to give this level of significance’.

The ‘rarely fails’ bit, emphasised by Fisher 90 years ago, has been forgotten. A single experiment that gives *P* = 0.045 will get a ‘discovery’ published in the most glamorous journals. So it’s not fair to blame Fisher, but nonetheless there’s an uncomfortable amount of truth in what the physicist Robert Matthews at Aston University in Birmingham had to say in 1998: ‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug.’

The underlying problem is that universities around the world press their staff to write whether or not they have anything to say. This amounts to pressure to cut corners, to value quantity rather than quality, to exaggerate the consequences of their work and, occasionally, to cheat. People are under such pressure to produce papers that they have neither the time nor the motivation to learn about statistics, or to replicate experiments. Until something is done about these perverse incentives, biomedical science will be distrusted by the public, and rightly so. Senior scientists, vice-chancellors and politicians have set a very bad example to young researchers. As the zoologist Peter Lawrence at the University of Cambridge put it in 2007:

hype your work, slice the findings up as much as possible (four papers good, two papers bad), compress the results (most top journals have little space, a typicalNatureletter now has the density of a black hole), simplify your conclusions but complexify the material (more difficult for reviewers to fault it!)

But there is good news too. Most of the problems occur only in certain areas of medicine and psychology. And despite the statistical mishaps, there have been enormous advances in biomedicine. The reproducibility crisis is being tackled. All we need to do now is to stop vice-chancellors and grant-giving agencies imposing incentives for researchers to behave badly.

Syndicate this Essay

MathematicsMedical ResearchHistory of ScienceAll topics →

David Colquhoun

is a professor of pharmacology at University College London and a Fellow of the Royal Society. He is the author of *Lectures on Biostatistics* (1971) and blogs at DC’s Improbable Science.

aeon.co

Over the weekend, I came across a compelling essay about the incredibly complex relationships that women raped during war have with the children born of that violent act.

The essay, which was published August 18 in the journal The Lancet, was written by two Dutch psychologists, Elisa van Ee and Rolf J. Kleber, who have worked with victims of rape from war-torn countries.

“Clinical case reports describe a high rate of ambivalent parent-child relationships or even abusive relationships and a high rate of serious discrimination within the societies in which these children are raised,” they write. “National difficulties have serious consequences for the child, who might experience attachment disorders, disturbances in psychosocial development, and identity issues.”

And this: “These children are generally regarded with disdain by their communities — they are referred to by such names as ‘devil’s children’ in Rwanda, ‘children of shame’ in Timor-Leste, ‘monster babies’ in Nicaragua, ‘dust of life’ in Vietnam, or ‘Chetnik babies’ in Bosnia-Herzegovina. [The World Health Organization] has described children born of rape as at risk of being neglected, stigmatized, ostracized, or abandoned. Cases of infanticide have also been reported. Despite such general concerns, little is known about the fate of these children.”

Van Ee and Kleber tell a moving story of Arya, a woman who was raped by soldiers during a 2003 coup d’etat in the Central African Republic, and her 5-year-old son, Anselme, born nine months after the rape: “Arya described her complex feelings for her son: she loved him because he was her own blood, but she also hated him since he resembled the rapists. She told us how sometimes she was tender towards Anselme, yet on other occasions she was harsh and wanted to beat the rapist part out of him. Most of the time, though, she just did not notice him so consumed was she with her own memories and sorrow.”

Unfortunately, as the psychologists acknowledge at the end of their essay, Arya is a composite of different women they have seen and treated. So although the story's essence may be true, it lacks the force of a fully factual account of an individual woman. (That’s why journalists do not use composites in their reporting.)

**Always complicated**

Yet the ambivalent feelings of “Arya” toward her son are similar to what writer Andrew Solomon details in a New Yorker essay he wrote a few weeks ago in the aftermath of Rep. Todd Akin’s (R-Mo.) now infamous comments on “legitimate” rape. Solomon interviewed women in the United States who are raising children conceived in rape for his latest book, “Far from the Tree: Parents, Children and the Search for Identity,” which is scheduled to be published in November.

“The aftermath of rape is always complicated,” he writes.

Many victims are simply in denial that they are pregnant in the first place: a full third of the pregnancies resulting from rape are not discovered until the second trimester. Any delay in detection reduces women’s options, especially outside major urban centers, but many women struggle with the speed of the decision; they are still recovering from being raped when they are called on to make up their minds about an abortion.

The decision of whether or not to carry through with such a pregnancy is nearly always an ordeal that can lead, no matter which choice is ultimately made, to depression, anxiety, insomnia, and P.T.S.D. Rape is a permanent damage; it leaves not scars, but open wounds. As one woman I saw said, “You can abort the child, but not the experience.”

Even women who try to learn their child’s blamelessness can find it desperately difficult. The British psychoanalyst Joan Raphael-Leff writes of women bearing children conceived in rape, “The woman feels she has growing inside her part of a hateful or distasteful Other. Unless this feeling can be resolved, the fœtus who takes on these characteristics is liable to remain an internal foreigner, barely tolerated or in constant danger of expulsion, and the baby will emerge part-stranger, likely to be ostracized or punished.”

One rape survivor, in testimony before the Louisiana Senate Committee on Health and Welfare, described her son as “a living, breathing torture mechanism that replayed in my mind over and over the rape.” Another woman described having a rape-conceived son as “entrapment beyond description” and felt “the child was cursed from birth”; the child ultimately had severe psychological challenges and was removed from the family by social services concerned about his mental well-being. One of the women I interviewed said, “While most mothers just go with their natural instincts, my instincts are horrifying. It’s a constant, conscious effort that my instincts not take over.”

The rape exception in abortion law is so much the rule that many women who wish to keep children conceived in rape describe an intense social pressure to abort them, and the pressure to abort can be as sinister as the restriction of access to abortion. There can be no question that, for some women, an abortion would be far more traumatic than having a rape-conceived child.

I read the harrowing autobiography of a girl who was put under involuntary anesthesia to have an abortion of the pregnancy that had occurred when her father raped her, so that her parents could keep their reputation intact. It’s a horrifying story because the abortion clearly constitutes yet another assault: it is about a lack of choice. But ready access to a safe abortion facility allows a woman who keeps a child conceived in rape to feel that she is making a conscious decision, while having the baby because she has no choice perpetuates the trauma and is bad for the child.

Rape is, above all other things, non-volitional for the victim, and the first thing to provide a victim is control. Raped women require unfettered choice in this arena: to abort or to carry to term, and, if they do carry to term, to keep the children so conceived or to give them up for adoption. These women, like the parents of disabled children, are choosing the child over the challenging identity attached to that child.

The key word in that sentence is “choosing.”

You can read Solomon’s essay, "The Legitimate Children of Rape," on the New Yorker's website. The essay by van Ee and Kleber, "Child in the Shadowlands," is also available in full on the Lancet’s website.

## 0 comments