*by Anne Kessler*

Upon completing my first year of graduate coursework, I found myself where no student wants to be – in a summer class. Per the stipulations of my training grant, I was enrolled in the Clinical Research Training Program’s *summer intensive* with a single goal in mind (*pass*) and very little appreciation for all that I would learn in the subsequent three months.

On the first day of introductory biostatistics, Einstein’s own Dr. Hillel Cohen discussed the use, misuse, and abuse of p-values in the scientific community. He asked our class, which was composed of both students and clinicians, to provide a good definition for the term ‘p-value.’ He was not at all surprised when the best response we could provide was what he coined “a popular but entirely incorrect” definition of what a p-value actually is. For me, this raised feelings of concern and terror – **How was it possible that a diverse room of researchers, all of whom use statistical methods to confirm that their research findings are relevant and significant, is incapable of collectively defining what a p-value is? **At the end of the first lesson, Dr. Cohen wrote the definition of a p-value on the board: “A p-value is the calculated, estimated probability of observing the test statistic (or results more extreme) given that (1) the null hypothesis is true, (2) the assumptions of the test are not meaningfully violated, and (3) there is no systematic or differential random error.”

**So, what does this mean – what’s in a p-value?**

The foundation of any p-value is a model referred to as a “known probability distribution.” These defined probability distributions (i.e. standard normal, t, chi-square, F, Poisson, etc.) allow us to estimate through inference the probability (p-value) that corresponds to a range of values (of the test statistic) that fall within the known distribution. Because p-values are derived from models, the validity of any given calculated p-value relies on the corresponding test statistic’s assumptions being met (i.e. violation of the statistic’s assumptions yield a p-value that is essentially meaningless regardless of its value or ‘significance’).

This understanding that the assumptions of the statistical test are not meaningfully violated is the first implied condition in a p-value, and one that is ultimately two-fold. The first assumption is that the sample subjected to the test statistic was selected at random from an infinite number of potential samples and is representative of the total population under study. In the literal sense, this sounds applicable mainly to clinical studies with human subjects. Technically speaking, however, a population can be made up of an infinite number of samples of literally anything (cells, mice, etc.). This assumption brings to question the use of test statistics obtained from *in vitro *and *in vivo *studies to improve our understanding of human conditions and stresses the importance of using only model systems that are useful and relevant to humans in order to translate statistically significant findings from the bench to true human biology. Here, it is important to remember that a statistically significant result does not necessarily indicate that the findings are meaningful in any system other than the one used in the experiment. The second assumption is that the statistic follows a known probability distribution, which ultimately indicates that extensive knowledge and understanding of both our sample (cells, mice, humans, etc.) and the statistical models available are crucial in deciding which statistics/defined probability distributions are most relevant for our studies.

The second implied condition in the p-value definition is that the data were collected without substantial systematic or differential random error. Systematic error is introduced when there are flaws in the design and/or equipment used in the study. Inherent to its nature, systematic error cannot be reduced by simply increasing the ‘n’ of an experiment or study, so the best way to control for it is to (1) design the experiments thoroughly/properly and (2) run appropriate checks and controls for the duration of the study to avoid introducing technique or equipment error. The second type of error assumed to be absent from a test statistic is differential random error, which is most often introduced during sampling. This type of sampling error occurs when the sample selected comes from an extreme end or ‘tail’ of the distribution, and it results in skewed findings that meaningfully differ from the true probabilities present in the actual population. This type of random error is hard to detect and thus important to consider when analyzing study outcomes and interpreting the validity of a given test statistic and its corresponding p-value.

** **

**And magic? **

At face value p-values are seemingly magic. Their numeric value can be the difference between significant and insignificant results, published work and wasted time, and making our bosses happy as a welcomed alternative to hiding out at our desks so as to avoid an inevitable confrontation. Dr. Cohen, however, suggests that we take some time to think deeper about what a resultant p-value actually means in the context in which it was determined, and I think we can all find a bit of confidence in understanding that “**the P value tool can be very valuable, provided that we remember that along with Probability, P can also stand for perspective – an indispensable part of keeping our interpretation of statistical results scientific and not magical**.”

** **

*References:*

- Cohen, HW.
*Am J of Hyp***24**, 18-23 (2011). - Albert Einstein College of Medicine CRTP Summer Intensive

*Annie got her start in science at the University of Wisconsin-Madison where she earned a BS in medical microbiology and immunology. Currently, she is a PhD candidate at Albert Einstein College of Medicine, studying pediatric cerebral malaria with a focus on host immunology and outcomes in HIV and malaria co-infection. In collaboration with the Blantyre Malaria Project, Annie travels to Malawi during the ‘rainy season’ to collect and process the patient samples required for her thesis work. Outside of the lab, she enjoys running, searching for NYC’s best cup of coffee, and rooting for the Wisconsin Badgers.*