Scientific Failures


For us humans it is important to not fixate on the chaos of cause and effect. If we see someone eat a plant and get sick, we shouldn’t eat it. If someone seems to get sick when they are cold, we should stay warm. This kept our ancestors alive. It is also why only up until a couple centuries ago Europeans and American settlers thought tomatoes were poison, and why people tell you to stay warm so you don’t ‘catch cold.’

It seems likely that humans didn’t evolve to be scientists. We evolved to survive, and our most basic model is a simple iterative cost vs. benefit analysis. This has resulted in incredible discoveries. Native Americans used tea from birch trees since pre-history, which contained vitamin C, to prevent scurvy. Making that connection probably took a while, and involved some luck, but it was extraordinary. On the other hand, in Europe the earliest record of finding the solution to scurvy was a British explorer recommending orange and lime juice in 1593. Despite this, there were tons of competing theories, most of which didn’t work. In the early 18th century over a hundred thousand men in the British navy died due to scurvy, and the Navy doctors wouldn’t suggest limes as they originally did not confirm to their theories of disease. Then, finally, in 1753 a British physician James Lind conducted a clinical trial that more or less settled the issue.

Looking back it seems obvious that they should have solved it sooner, and if they had a more developed philosophy of science they would have. But there were hilariously challenging confounding factors to work through. Fresh citrus cured scurvy, but juice that had been exposed to copper tubing and light didn’t. Fresh meat contained vitamin C as well, but salted meat did not. Improved nutrition in general prevented scurvy. If you were trying to figure this out you might notice that citrus juice is not helping, but it’s going to be a few centuries before the periodic table of elements is even invented, so you have no conception about how all substances are in fact composed of many smaller molecules and vitamins.

Then you theorize that you need fresh produce, but fresh meat prevents or cures scurvy in your crew. So that is no longer a convincing argument. And over thousands of shipping trips, someone might eat one of many different foods with vitamin C and be cured, and now you have a whole set of anything they did or ate in the past few days as a potential solution. So everyone starts developing folk theories about how to prevent scurvy. And what is really comical, is that the new scientific view formed following the germ theory of disease suggested that scurvy was caused by bacteria in tainted or old meat. So if you were a scientist invested in the germ theory of disease, you might not be too keen on evidence that seems to go against your scientific argument.


Science journalism seems to be growing in popularity. On my daily commute I listen to NPR and hear the newest social science research. The papers on popular or controversial issues are quickly distilled and find themselves on major journalism publications, such as the New York Times.  There have even been new platforms, like Vox, which claim to take a scientific and analytical approach to the news. The articles are usually about inequality, gender or race, labor economics, and a whole bunch of sci-fi space junk.  The articles usually play up the authority of the scientists, and take their findings at face value.

I know a lot of smart people who read and share these articles on Facebook, LinkedIn, and even talk about it and mention it at work. Pointing out methodological flaws or telling people you don’t believe them when they talk about interesting research they heard isn’t something you should do if you enjoy having friends.

The abuse of research-design and statistical methods is what lets most awful research take on a veneer of authority.  Fishing for significance is the most common error, as p-values are the bread-and-butter of modern statistical inference. If we have a 5% p-value that tells us our parameter is 8.5 under the null-hypothesis, it means that if we were to assume that the parameter we are estimating was 0 (null), there would only be a 5% probability that it is 8.5 or greater.  One major problem with this is explained by Andrew Gelman, who wrote a great paper that touches on one of the main issues here,  Statistical Significance is not itself Statistically Significant. The point here is basically that a p-value moving from 4.9% to 5.1% isn’t actually a significant movement, even though a 5% p-value is often viewed as ‘proof’ of scientific existent in the peer-review process.
There are additional statistical issues with significance. For example, the assumption is usually that hypothesis we are testing our parameter against is ‘no effect’ (i.e. zero). But this depends on the circumstances, and is not always true. Then there are also concerns about the size of the parameter.  If you wanted to measure differences in height within the US, and cut the country into two equal halves, the difference in height would be statistically significant. After all, we are dealing with the population, so as long as the two averages aren’t equal, they are statistically significant by definition (our standard errors are zero).

Both of those issues are the most commonly cited when criticizing modern science, but in my view they are derived from a more insidious issue. When fitting a model there are usually thousands of plausible specifications to choose from. It’s easy to test tons of model variations until p-value sticks out, and then create a great story on why this is the optimal model.  For example, there is a new paper out claiming an anti-depressant, Paxil, can cause increased risk of suicide in teenagers. This paper uses the same data that was used in the drug trial that concluded Paxil was safe, but comes to a separate conclusion that Paxil has side effects that were ignored in the original study. The original paper argued Paxil was safe, and the statistical evidenced in the original paper did not suggest it caused an increase in suicide risk. It’s not surprising this is hard to measure, as suicide is very rare, which means you might only have a few cases of patients reporting suicidal thoughts, and probably zero patients who commit suicide.

The data was from 1994 to 1998, and focused on 275 adolescents with major depression that had lasted at least eight weeks. There was a double-blind treatment with paroxetine (Paxil), imipramine, or a placebo. The study had an eight week randomized control trial, and was then followed by a six month continuation phase.  Similar to most antidepressant trials, the main outcome variable is a survey called HAM-D, which indexes depression from 0 (none) to 52 (extremely suicidal due to depression).  There are many assumptions on how to interpret this, which you can read in the paper if you’re interested.

The criticisms in this paper appear to be somewhat justified. The original paper made a series of choices when recording and reporting the data, each of which would be plausible on its own, but when combined suggests that—whether by luck or design—their data was presented in a way that slightly understates adverse effects.  There were two points the new paper makes that I found most compelling: The first was that the original study only reported a negative effect if it was above 5% of the sample, but went on to create very specific categorizations. For example, anxiousness, nervousness, and agitation could each only reach 4% of the sample, but could be argued that are different words for the same symptom. The second was that the original authors made access to their data and documentation extremely challenging. For such important research this is unacceptable, and should be required with publication.

The main and most popular finding in this paper had to do with adolescents being at higher suicide risk than originally thought, so let’s look into this: Using their new methodology, they found that five patients dropped out due to suicidality, whereas the original paper had that metric as zero. This new methodology also had 3 patients drop out due to suicidality in the placebo groups, which were also originally zero. Based off patient documentation they also noted that there were 11 suicidal patients during the acute phase and taper compared to 5 suicidal patients in the original study (although this first number is including the taper phase, which the original study didn’t include). Throughout the entire study one patient unsuccessfully attempted suicide.

The difference in the two papers can be explained by the garden of forking paths, sometimes also called researchers degrees of freedom. It’s a concept of how many different ways a researcher can compare the same data to achieve the desired results. In these two papers, the authors of the original paper would benefit more from supporting this medicine, as they were employed by the drug company. In the second paper, they would benefit more from finding a severe flaw to support their argument and get a great publication (and to their credit, they admit this in their paper).

Based off the replications main analysis, their biggest complaint is that the first paper understates suicidality, as well as other minor issues.  But this paper doesn’t find the smoking gun they claim. The truth seems to be one of differences in coding. Imagine two Psychiatrists who each meet with the same 80 severely depressed patients over eight weeks. At the end one says “I think about five of them were low-risk suicidal” (because remember, high-risk suicide requires being committed). The second says “I disagree, I think 11 of them were suicidal.” They then sit and compare notes, and it turns out they look for slightly different signals and indicators. One of them is really conservative and documents anything that could be perceived as suicidal, and the other takes a pragmatic approach.

The statistical power here, the likelihood to detect an effect when there is an effect to be tested, is very low for rare events. If one in a thousand users of Paxil kills themselves due to the drug, this study wouldn’t even have a high chance to detect this result.  Not to mention this study took place about 20 years ago. Since then there have been millions of adolescents who have taken Paxil. While that data might be harder to find, and isn’t a randomized study, it has a sample size of millions. I do not think quibbles over classifications over a few people out of a sub-sample of 80 from 20 years ago should hold that much weight – although I could be wrong as I haven’t worked in this field.

In each case though there are many reasonable choices that could result in slight benefits either for or against the drug’s safety. This gets at the reason I was skeptical of both papers strong claims towards safety or danger, the truth is I don’t think they know to the extent they claim. Both research papers are important, as it gives us a reasonable profile of the risks and benefits of Paxil. But when arguing on the margins of an extra few people being suicidal, it’s hard to take it seriously, as the variance of the research design itself is much bigger than the change in effect.


Karl Popper argues that our reasons for coming up with a hypothesis or question exist outside of a scientific framework and are unimportant, but once they appear they must be tested rigorously and properly. As a strict philosophy of science this makes sense, since human curiosity is capricious. Unfortunately for the pure philosophy of science, most academic and private sector research has a clear benefit towards proving their hypothesis correct. When someone asks a question, there is usually an answer they either want to be true, or one they think is true and they want to try and prove their intuition is correct. There is a famously bad paper on whether women are more likely to wear pink or read when they are fertile. Why did they ask that question? I’m guessing that their thought process went something like “Women wear red and pink to embrace their femininity, since society views them as feminine colors. I bet when women are most fertile they subconsciously act more feminine to attract the attention of males. I should explore this!”

There are a few problems immediately. The most obvious one is the researcher clearly wants the answer to support the hypothesis; otherwise there is no fun quirky research paper that gets published and widespread science journalism acclaim. The second is it will justify their brilliant intuition and earn them respect and advance their career. Then the third is that there are many different ways to measure this hypothesis, both in the original question and the model specification. The same scientific question could be achieved by examining the level of skin women show, cleavage, makeup, time spent talking to men, and so on. They would all try to measure the same phenomena. Once any of those are chosen, there are many different ways to set up the research design, collect data, and fit a model to the data. It’s so easy to support your hypothesis when you have such wide freedom and you only need to find one that supports your hypothesis and ignore the rest.

None of this is reassuring for the scientific method. There aren’t clear rules on how to set up the right design outside of a randomized experiment. In this instance the question and data do not seem rooted in a robust method. Part of this is also because I view the subtleties of human behavior as usually hard to tease out from the daily noise and complexity of our world.

If all these scientific issues are known—as I certainly didn’t come up with them—why do they persist?  I think it is because even though philosophers of science and some statisticians are extremely interested in them, most other academics don’t appreciate the complexity of reality.  Trying to understand all the chaos that we can’t understand is strange, but it is necessary to have a measure of our uncertainty, which is the heart of Debora Mayo’s seminal research on the philosophy of error statistics. I recently watched a youtube video of a 9/11 ‘truth’ conference, created by an organization of engineers. The presenters were mathematicians, engineers, and other PhDs and academics. They created computational simulations of the towers crashing, presented chemical experiments showing reactions between steel beams and thermite, and generally had a deep and impressive knowledge of structural physics. I know very little about their fields, but I know they are wrong. The world is full of emergent properties on a scale we probably can’t comprehend.  Even if they are much better at mathematical models than I am, my conception of omitted variable bias is better, even though all I’m doing is appealing to the complexity of the world. Even brilliant men make this mistake.  Alan Turing in 1950 made the following claim:

I assume that the reader is familiar with the idea of extra-sensory perception, and the meaning of the four items of it, viz. telepathy, clairvoyance, precognition and psycho-kinesis. These disturbing phenomena seem to deny all our usual scientific ideas. How we should like to discredit them! Unfortunately the statistical evidence, at least for telepathy, is overwhelming.

Turing was a defining genius in human history that focused on math, computers and cryptography, which are inherently logical structures that are fully founded in their base properties. Alan Turing bought into the poor research design on telepathy that found statistical significance, and felt there was no choice but to accept it then as scientific fact.

Linus Pauling founded quantum chemistry and molecular biology, and won the Nobel prize in chemistry. He later claimed vitamin C could cure cancer based on a reasonable hypothesis, and nothing could change his mind. He was convinced it was the case. If you think he was just crazy in his old age, and then you need to explain why despite being completely refuted, it’s still common knowledge that vitamin C cures colds (although these days its Zinc, based on new bad research).


This all ties back to modelling. It becomes easy to let the strange and unpredictable emergent properties and chaos of the world drop out. Since we can’t observe them, and we don’t know how they bias our model, it is difficult to understand how our model of the world is wrong. The randomization can do a great job fixing this, but is usually impossible to implement. By conceptualizing the world through science experiments we have made incredible progress. If we were able to send our knowledge on the scientific method back to 16th century Britain, but no additional knowledge, they would probably have been able to set up a series of tests on different boats with clever use of controls, and find a solution to scurvy within a year.

I think if we were able to similarly only receive knowledge on the scientific method from 1,000 years from now, we could also make a leap in progress in understanding how to set up and learn from research designs on issues from drug research to microeconomics. That is the optimistic view. The pessimistic view is that we already know far more about the proper use of the scientific method than is used in academic research, even at the highest levels of research, since the truth often does not line up with passing a drug trial, being published, or getting tenure.