The use of probability values for testing “significance” of statistical evidence is often misunderstood and misused, leading to incorrect conclusions. Sunil Gupta discusses the role of p-values in reproducibility.
Anyone who has taken an introductory course in statistics has probably learnt about hypothesis testing and the p-values (also called probability values) used for testing “significance” of statistical evidence. Unfortunately, it is one of the most misunderstood and often misused concepts in statistics leading to incorrect conclusions in published literature. This has played a significant role in the reproducibility crisis in science.
According to a survey published in the journal Nature in 2016, more than 70% of researchers have tried and failed to reproduce another scientist's experiments.
The problem is multi-fold, starting with missing or under-reported experimental design assumptions, bad experimental design, insufficient data, or other violations of good experiment design practices. Just in cancer research alone, the resulting waste from this crisis in the U.S. amounts to $28 billion a year, according to a recent article in Slate – “Cancer Research is Broken”.
P-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the difference in sample mean between two groups) would be equal to or more extreme than its observed value. A null hypothesis may postulate that there is not difference. So a small p-value would indicate that that data is incompatible with the null hypothesis if all assumptions underlying the statistical model hold.
There are at least two key challenges:
Some research studies incorrectly interpret p-values to imply the reverse – the probability that our hypothesis is true given the observed data. However, the two are not the same. That is,
P (Hypothesis | Observed Data) P (Observed Data | Hypothesis)
P-value cannot be used to conclude that the hypothesis is true.
If p-value <= 0.05 and the null-hypothesis is actually true, this would indicate that over many uses of the tests across different studies, you would reject the null-hypothesis 5% of the time and be in error. This leads to the second main issue – statistical significance may be observed simply if many conditions are collected and analyzed but only those with p-value less than 0.05 reported.
The statistics community has recently devised new guidelines to help scientists and researchers about the pitfalls of p-values and to suggest other ways to improve analysis such as confidence intervals, effect sizes, and fully reporting the underlying experimental conditions (see http://dx.doi.org/10.1080/00031305.2016.1154108 for a statement from the Journal “The American Statistician”).
We are already beginning the see the positive impact on reproducibility.
Sunil Gupta, is the Director of Product Innovation at IEEE and has over 25 years of experience leading research in speech & signal processing, hardware/software product development, embedded systems, startups, finance, portfolio analytics and education.