Author: Birinder Giddey
Reviewer: Chris Nickson
Journal Club 019
Walsh M, Srinathan SK, McAuley DF, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. Journal of clinical epidemiology. 67(6):622-8. 2014. [pubmed] [free full text]
- The Fragility index is the minimum number of patients whose status would have to change from non-event to event to cause the result to no longer be significant (p > or = 0.05). Could this metric could be a useful interpretive tool to assess the “fragility” of the results of a randomised controlled trial (RCT)?
TYPE OF STUDY
- Post-hoc analysis of selected RCTs published in major medical journals
- 1273 abstracts were reviewed for eligibility
- 399 published trials met eligibility criteria
- Inclusion criteria for RCTs:
- Published in selected major medical journals: NEJM, The Lancet, JAMA, Annals of Internal Medicine and the BMJ
- Published between Jan 2004 and Dec 2010
- Parallel arm or two by two factorial design RCTs
- 1:1 ratio to intervention and control
- At least one dichotomous or time-to-event outcome as significant (P<0.05 or 95% CI excluding null value)
- Exclusion criteria
- Non-inferiority trials
- Fragility Index was calculated as follows:
- Results of each trial represented in two-by-two contingency table. Index calculated by adding an event from the group with smaller number of events (and subtracting a nonevent from the same group to keep numbers constant).
- P-value re-calculated using Fisher’s Exact Test
- This was repeated until P-value > or = to 0.05
- The number of additional events required was called the Fragility Index
- Fragility Index was analyzed as follows:
- correlated with trial characteristics, including sample size and total number of outcomes
- Trial characteristics:
- Median sample size: 682
- Median number of events: 112
- Median Fragility Index = 8
- 25% Trials had FI < or = 3
- Trials with higher Fragility Index (i.e. ‘less fragile’) had:
- Had results with smaller p-values
- larger number of events
- larger sample size
- Trials with lower Fragility Index (i.e. ‘more fragile’) had:
- poor or unclear allocation concealment
CRITICISMS AND COMMENTARY
- The Fragility Index is a simple metric that encompasses important trial characteristics such as sample size and the event rate (and hence study power)
- The Fragility Index appears to be useful as many clinicians are unlikely to have substantial training in probability and statistics and incorrect interpretation of P-values and confidence intervals appears to be widespread.
- Fragility Index may identify trials at high risk of ‘medical reversal’ when further studies of the same intervention are performed.
- The study authors give the example of the LIMIT-2 study published in the Lancet in 1992, which found improved 28-day survival from IV magnesium after acute MI with p=0.04. However, the Fragility Index was only 1. Three years later, ISIS-4 was published. This study had a larger sample size (58,000 patients) and a mortality benefit was no longer found.
- A paper recently published in CCM found that the Fragility Index of ICU trials is low (median of 2, with 40% of trials having a Fragility Index of 1 or less!), suggesting that much of our evidence base is weak.
- The results of a trial should be viewed with particular skepticism if loss to follow up exceeds the Fragility Index, as this could easily explain the significant result
- Fragility Index has limitations:
- applies to trials with 1:1 randomisation
- cannot be applied to continuous data, requires dichotomous outcomes
- use in time-to-event analysis may not be appropriate.
- Could this simply be another tool for clinicians to use improperly?
- Will the Fragility Index really help swing the balance of belief, or will clinicians just interpret results according to their pre-existing biases?
- Clinicians are at risk of over-interpreting the significance of RCT findings when their results hinge on the occurrence of very few events. The Fragility Index promises to be a useful tool to guard against this, by indicating the number of events required to make a statistically significant result nonsignificant.