Keywords
In research authors, journals, and readers all aim to produce, publish, and consume significant (worthwhile) content. Thereby, an assumed measure of “significance” (i.e. p < .05) is easily mistaken as a handy indicator. However, the .05 threshold is completely arbitrary and p values as such are inappropriate to guide clinical or scientific decision making. p stands for (statistical) probability and not for (clinical) certainty; thus, it characterizes individual comparisons statistically but without clinical interpretation. Significance is thereby defined very narrowly and must not be confused with clinical relevance, generalisability or even meaning of findings.
Two biological observations are never identical, but will always show a natural degree of variation even if the same sample was evaluated twice under identical conditions. The main challenge lies in differentiating whether the observed difference reflects such “background noise” or a real difference (attributable, for instance, to an intervention). The p value is a direct measure of the probability that the observed difference is a simple chance finding (i.e. unreal). If this probability is very small, for instance less than 5%, then the assumption of a true difference (or treatment effect) may seem justified.
Importantly, the 5% threshold is not absolute but relies on convention only. There is no critical difference between p = .045 and p = .055: the likelihood of a chance finding differs just by 1%. At other times, a remaining uncertainty of 5% may be unacceptably high: a car with a known 5% risk of brake failure would probably not be licensed! A p value provides at best a crude orientation regarding the probable realness of specific group differences, but is too simplistic to explain the “big (clinical) picture”.
Specifically, p values must not be mistaken as a substitute for critical appraisal in many crucial aspects.
(1) A p value does not indicate whether the described comparison was justified (e.g. whether the compared groups were comparable to begin with). This fundamental precondition must be ascertained within the study design. For instance, randomised controlled experiments approach the ideal of unbiased study group comparability best, but to a certain extent this can be emulated by stratified or confounder adjusted observational studies.
(2) A p value ignores whether the selected statistical test was appropriate. The correct choice depends on the data to be assessed, the sample size, the comparative concept, and the outcome format, all of which must be checked during critical appraisal.
(3) As elaborated, the threshold at .05 leaves significant uncertainty, whether the assumption of a difference (or treatment effect) is, in fact, correct (alpha error). The need for additional safety margins (i.e. a lower degree of uncertainty, for instance p < .01) depends on the clinical context. Conversely, as p values refer to specific samples only, a “non-significant” p > .05 does not exclude relevant effects of an intervention in clinical reality (so called beta error): absence of proof is not proof of absence!
(4) A p value depends on the sample size: the larger the sample, the smaller the associated p value and the higher the risk of “accidental” significance at the 5% threshold. Remember, p values do not reflect the clinical relevance of a finding, even if the underlying difference is real. A clinically modest treatment effect may appear “significant” when tested in a large enough (“overpowered”) sample. If, for instance, a trial reported a real antihypertensive drug effect (p < .001),
1
the clinical decision whether to expose your patient to any potential adverse effects (for the benefit of a diastolic pressure reduction by 4.4 mmHg at 8 weeks) should not be driven by the p value. Clinical relevance must be appraised by appropriate measures such as effect size (e.g. relative risk, absolute difference or number needed to treat) with estimated precision (i.e. confidence intervals). The latter represents an important alternative for the assessment of statistical (and clinical) significance. Conversely, small (“underpowered”) study samples must not be used to dismiss treatment effects (see (3)): a power calculation is always required for adequate appraisal.(5) A p value does not indicate whether the study design was predefined or the analysis plan adopted before data inspection. Therefore, p values ignore biased selection of patients or study periods just as they ignore, statistical fishing expeditions (i.e. multiple hypothesis testing). Consequently, explorations should always be validated in hypothesis driven investigations in different study samples.
(6) A p value, as such, never indicates causality. Other criteria including chronological sequence, biological plausibility, and exclusion of confounding effects must be met before a causal relationship may be assumed.
And (7), a p value refers to summary statistics of specific study samples only. The application of study findings to individual patients is only justified after appraisal of their external validity (i.e. generalisability).
Clearly, p values represent a precious first aid for orientation. However, they must be carefully interpreted against study design, sample size, comparability of study groups, and appropriateness of statistical tests, and be pondered against clinical significance. Categorisation (e.g. p < .05, p = n.s., etc.) obscures the interpretations of this continuous measure and is unacceptable. At any rate, comprehensive appraisal of scientific information must go beyond a single indicator. It is the responsibility of anyone dealing with summary statistics to assure that study question and design, statistical approach, and presentation of results are sound before accepting or dismissing reported findings. A p value < .05 may be significant statistically, but never proves clinical significance.
Reference
- Efficacy and safety of nebivolol and valsartan as fixed-dose combination in hypertension: a randomised, multicentre study.Lancet. 2014; 383: 1889-1898
Article info
Publication history
Published online: September 02, 2015
Identification
Copyright
© 2015 European Society for Vascular Surgery. Published by Elsevier Inc.
User license
Elsevier user license | How you can reuse
Elsevier's open access license policy

Elsevier user license
Permitted
For non-commercial purposes:
- Read, print & download
- Text & data mine
- Translate the article
Not Permitted
- Reuse portions or extracts from the article in other works
- Redistribute or republish the final article
- Sell or re-use for commercial purposes
Elsevier's open access license policy
ScienceDirect
Access this article on ScienceDirectRelated Articles
Comments
Commenting Guidelines
To submit a comment for a journal article, please use the space above and note the following:
- We will review submitted comments as soon as possible, striving for within two business days.
- This forum is intended for constructive dialogue. Comments that are commercial or promotional in nature, pertain to specific medical cases, are not relevant to the article for which they have been submitted, or are otherwise inappropriate will not be posted.
- We require that commenters identify themselves with names and affiliations.
- Comments must be in compliance with our Terms & Conditions.
- Comments are not peer-reviewed.