Abstract
Background
Numerous reviews and meta-analyses of the antidepressant literature in major depressive disorders (MDD), both acute and maintenance, have been published, some claiming that antidepressants are mostly ineffective and others that they are mostly effective, in either acute or maintenance treatment.
Objective
The aims of this study were to review and critique the latest and most notable antidepressant MDD studies and to conduct our own reanalysis of the US Food and Drug Administration database studies specifically analyzed by Kirsch et al.
Methods
We gathered effect estimates of each MDD study. In our reanalysis of the acute depression studies, we corrected analyses for a statistical floor effect so that relative (instead of absolute) effect size differences were calculated. We also critiqued a recent meta-analysis of the maintenance treatment literature.
Results
Our reanalysis showed that antidepressant benefit is seen not only in severe depression but also in moderate depression and confirmed a lack of benefit for antidepressants over placebo in mild depression. Relative antidepressant versus placebo benefit increased linearly from 5% in mild depression to 12% in moderate depression to 16% in severe depression. The claim that antidepressants are completely ineffective, or even harmful, in maintenance treatment studies involves unawareness of the enriched design effect, which, in that analysis, was used to analyze placebo efficacy. The same problem exists for the standard interpretation of those studies, although they do not prove antidepressant efficacy either, since they are biased in favor of antidepressants.
Conclusions
In sum, we conclude that antidepressants are effective in acute depressive episodes that are moderate to severe but are not effective in mild depression. Except for the mildest depressive episodes, correction for the statistical floor effect proves that antidepressants are effective acutely. These considerations only apply to acute depression, however. For maintenance, the long-term efficacy of antidepressants is unproven, but the data do not support the conclusion that they are harmful.
Key words
Introduction
Much controversy has surrounded recent meta-analyses and randomized clinical trials (RCTs) of antidepressant efficacy in major depressive disorder (MDD), including in the nonscientific media. In this review, we use the concept of effect sizes to make clinical and scientific sense of what has become a cultural debate.
Examined here are the most prominent RCTs or meta-analyses of RCTs published in the last 5 years for both acute and maintenance efficacy of antidepressants in MDD. A summary of the review of these studies is provided in Table I.
Table ISummary of analysis of reviews of antidepressant efficacy in RCTs of MDD.
Study | N | Trials Reviewed | Effect Sizes (95% CI) | Comments |
---|---|---|---|---|
Rush et al 1 STAR*D RCT | 3671 | 1 | 67% acute remission, 26% maintenance remission | No pbo group. Good acute efficacy is shown, but maintenance efficacy is about one half less than acute efficacy. |
Kocsis et al 23 and Kornstein et al5 Maintenance RCT of venlafaxine vs pbo | First maintenance study (year 0) n = 1096 Second maintenance study (year 1) n = 114 | 2 | 92% 2-year efficacy reported; this reflects 11% of original sample | “Super-enrichment” design. Second maintenance study sample was only ∼10% of the initial sample |
Turner et al 7 MA of FDA database of RCTs | 12,564 | 74 | 0.37 (0.33, 0.41) for published studies vs 0.15 (0.08, 0.22) for unpublished studies. ES of 0.31(0.27, 0.35) when all studies are combined. | 31% of studies were unpublished, accounting for 27.5% of the sample |
Kirsch et al 2 MA of FDA database | 5133 | 35 | Overall standardized ES was 0.61. Absolute ES HDRS of 9.6 drug and 7.8 pbo. | NICE criterion for clinical significance was absolute ES of 3 HDRS points or standardized ES of d = 0.5 for AD-pbo difference. Overall nonstandardized effect size of 0.32 increases to 0.40 when corrected for baseline severity (authors do not discuss) |
Horder et al 8 Reanalysis of Kirsch et al 2 | 5133 | 35 | Absolute HDRS difference between AD and pbo = 2.70 (including negative unpublished studies) | Reanalysis was based on (1) random effects rather than fixed effects model as in Kirsch et al and (2) pooling ES differences study by study rather than summing all studies and then ES difference. These changes produce a much larger ES near the NICE threshold. |
Davis et al 14 Narrative summary of MAs and RCTs | Not reported | Not reported | Mean acute difference between AD and pbo = 23.6% Mean maintenance difference between AD and pbo = 36% | Uncritical about bias toward ADs in maintenance studies using the enriched design |
Fountoulakis and Möller 13 Reanalysis of Kirsch et al 2 | 5133 | 35 | Mean AD ES was 10.05, not 9.60, as in Kirsch et al. AD-pbo difference was 2.18, not 1.80 as in Kirsch et al. Venlafaxine and paroxetine absolute HDRS ES were 3.12 and 3.22, respectively, exceeding NICE threshold. Nefazodone and fluoxetine did not. | Reanalysis was based on weighting the mean difference by sample size. |
Andrews et al 6 MA of maintenance RCTs | 3454 | 46 | Risk difference AD-pbo for relapse = 0.20, meaning 20% increased rate of relapse with AD than with pbo. | MA used an “enriched” design in favor of the pbo arm. |
Briscoe and El-Mallakh 22 Reanalysis of maintenance RCTs | 449 | 5 | 5 RCTs examined for AD efficacy after 6 mo. Four of 5 studies showed no benefit with AD over pbo. | Only analysis to correct for enriched design, which is biased in favor of ADs. Removes relapses due to AD withdrawal. |
Vöhringer and Ghaemi (present study) Reanalysis of Kirsch et al 2 MA to correct for statistical floor effect | 5133 | 35 | Relative effect size for mild depression was 5% (HDRS < 24), 12% for moderate (24 < HDRS > 28), and 16% for severe depression (HDRS > 28). | NICE criterion is met by 11.5% relative difference between AD and pbo. This analysis disproves the claim by Kirsch et al that only severe depression has clinically meaningful ES. Moderate depression also met NICE criterion. |
AD = antidepressant; CI = confidence interval; ES = effect size; FDA = US Food and Drug Administration; HDRS = Hamilton Depression Rating Scale; MA = meta-analysis; MDD = major depressive disorder; NICE = National Institute for Health and Clinical Excellence (UK); pbo = placebo; RCT = randomized clinical trial.
In acute depression RCTs, some reviews involve reanalysis of the US Food and Drug Administration (FDA) database of RCTs conducted by pharmaceutical companies. The major nonpharmacuetical industry study is the National Institute of Mental Health (NIMH)–sponsored Sequenced Alternatives for Treatment-Resistant Depression (STAR*D) project.
1
The pharmaceutical trials have been analyzed and reanalyzed by different authors, with the most media attention being given to the analysis by Kirsch et al.2
Other published analyses are also important.3
Maintenance RCTs for prevention of depressive episodes have been analyzed in the Cochrane database
4
; most of these studies were conducted by pharmaceutical companies. The most prominent and highly marketed and cited recent study of the topic was a 2-year RCT of the antidepressant venlafaxine.5
A recent reanalysis of the maintenance RCT studies has also examined the impact of antidepressant discontinuation, concluding that antidepressant use may cause long-term biological harm.6
The STAR*D study also provides data for analysis regarding maintenance prevention of depressive episodes in MDD.1
Patients and Methods
We analyzed recent prominent RCTs and meta-analyses that addressed antidepressant efficacy in MDD. We examined how assessment of effect sizes could clarify the controversies surrounding acute and maintenance efficacy of antidepressants in MDD. Effect estimates given by these studies are reported, along with their 95% CIs when available.
Results
Eleven prominent RCTs or meta-analyses of RCTs (2006–2011) are summarized in Table I. Each study is broken down in terms of the main aspects of its study design, clinical characteristics, and outcomes. Later those results are described in more detail in 2 sections—acute and maintenance studies—and are interpreted using effect size concepts. In Table II, we report our reanalysis of the results of a prominent meta-analysis
Mild = at least 1 arm (drug or placebo) is rated <24 on HDRS; moderate = at least 1 arms is rated >24<28 on HDRS; severe = at least 1 arm (drug or placebo) is rated >28 on HDRS.
2
to correct for a statistical floor effect in mild depression. In doing so, we discovered that the claim that antidepressants are effective only in severe depression, not in moderate or mild depression, is wrong. They are also effective in moderate depression, as explained later.Table IIRelative effect size difference (drug/placebo) by depression severity in Kirsch et al's meta-analysis (n trials = 35).
Drug | Placebo | ||||||
---|---|---|---|---|---|---|---|
Depression Severity (% studies in database) | Mean Baseline HDRS Score | Mean Final Change in HDRS Score | Relative Effect Size Measure (%) | Mean Baseline HDRS Score | Mean Final Change in HDRS Score | Relative Effect Size Measure (%) | Relative Effect Size Difference (%) (Drug-Placebo) |
Mild (23%) | 22.6 | 8.8 | 39 | 23.2 | 8 | 34 | 5 |
Moderate (54%) | 25.6 | 10.5 | 41 | 25.4 | 7.4 | 29 | 12 |
Severe (23%) | 28.75 | 12.0 | 42 | 28.2 | 7.2 | 26 | 16 |
HDRS = Hamilton Depression Rating Scale.
† Relative effect size = absolute mean HDRS change/mean baseline HDRS score.
Discussion
Acute Depression
Analyses of the FDA Database
The pharmaceutical industry is obligated to submit all data, positive or negative, regarding studies of drugs that receive FDA approval. Through the Freedom of Information Act, scholars have begun to get access to these FDA records. Previous systematic reviews of such studies of antidepressants in MDD have shown that many studies with negative results have gone unpublished. Turner et al showed that approximately 94% of the published literature on antidepressants in MDD demonstrates efficacy (positive studies), but when the unpublished FDA database is included, only 51% of all such studies (published and unpublished) show positive results. The standardized effect size fell from about 0.37 to 0.31 after including the negative unpublished studies, both effects being in the mild range.
7
The same year as the above analysis, another was published by Kirsch et al
2
with a smaller sample of the FDA database (less than half the size of the analysis by Turner's group). It confirmed an unstandardized effect size of 0.32, similar to that for the previous analyses by Turner et al. The key difference was that Kirsch et al's meta-analysis2
focused on a clinical significance criterion set in the United Kingdom by the National Institute for Clinical Excellence (NICE): a 3-point difference on the Hamilton Depression Rating Scale (HDRS) or a 0.5 standardized effect size difference. As shown in Table I, the results of this reanalysis fell short of those effect size cutoffs, except for severe depression.In follow-up popularizations, the first author of that meta-analysis
2
interpreted his analysis as indicating that, in general, antidepressants do not have clinically meaningful effects in MDD. In the scientific paper, the authors were more circumspect although still critical; they attributed antidepressant benefit to only “the most extremely depressed patients,”2
although a HDRS cutoff of 28 is not, in clinical practice, descriptive of the most extremely depressed patients. Many such patients have HDRS scores in the 30s or higher. In this meta-analysis, the drug-placebo difference varied based on severity of illness, approximating 0 at a HDRS of 24, and reaching about 3 points at a HDRS of 28. The authors note that this effect was due to changes in the response to placebo, which fell with increasing severity, rather than the response to antidepressant, which was consistent. Although they noted this finding, the authors never grappled with its meaning. It would seem that mild depression is highly responsive to placebo but severe depression is not. The authors appear to conclude that antidepressants are not more effective in severe depression, but in fact they are. The loss of placebo “response” may not be the loss of a response to anything at all; placebo response reflects, in part if not in whole, the natural history of depressive episodes. Severe depression does not go away rapidly; if it is not treated, it remains. Antidepressants treat it and are effective. The authors do not see this because they have ignored the importance of the natural history of depressive episodes in assessing treatment effects.Reanalysis of the FDA Meta-Analyses: Correction for a Floor Effect Disproves Claims of Antidepressant Inefficacy
A key statistical issue in comparisons of mild versus more severe depression, when using absolute effect sizes, is a floor effect. With a lower baseline HDRS score, the same drug-placebo effect (eg, 50% decrease in scores) produces smaller absolute differences (eg, 20 to 10 HDRS points—a 10-point difference) compared with a higher baseline HDRS score (30 to 15 HDRS points—a 15-point difference). In this meta-analysis, the drug-placebo difference, when adjusted for baseline severity of illness, increased in nonstandardized effect size from 0.32 to 0.40. In other words, some of the apparent lack of benefit of antidepressants in milder depression may be an artifact of this floor effect. Kirsch et al reported this result in a table but did not comment on it.
2
Another way to address this problem is to report the relative (not absolute) drug-placebo difference, dividing absolute change by baseline severity of depression. This was not reported in Kirsch et al's analysis. For the first time, we provide such an analysis in this article.
Table II shows the percentage differences in drug effect, with the absolute change in the drug group divided by the baseline HDRS score. Using this relative effect measure, antidepressants were somewhat less effective in milder depression (HDRS with baseline scores at ≤24) than in severe depression (baseline HDRS ≥28); the relative antidepressant versus placebo benefit increased linearly from 5% in mild depression to 12% in moderate depression to 16% in severe depression. The studies used in the meta-analysis
2
had a weighted mean baseline HDRS of 25.5.8
Using that baseline and the absolute improvement rates near those reported in the study (9.6 for drug, 7.8 for placebo) but widened to meet the NICE criterion of ≥3 points difference (ie, >10 for drug vs <7 for placebo), we can calculate that the NICE criterion would have been met with relative drug improvement of 39.2% (10/25.5) versus relative placebo improvement of 27.5% (7/25.5), for a drug-placebo relative difference of 11.7%. With this definition of the NICE criterion, antidepressants still do not meet that definition in mild depression (HDRS < 24), but they do meet it for both moderate (HDRS 24–28) and severe (HDRS >28) depression.In this reanalysis, we used the same severity cutoffs as used by the authors of the meta-analysis: HDRS scores <24, 24 to 28, and >28. We labeled these 3 groups as mild, moderate, and severe, respectively. Despite analyzing their data in these 3 groupings, the authors of the meta-analysis claimed they had used the American Psychiatric Association's criteria for severity of symptoms (based on HDRS scores): mild (HDRS = 8–13), moderate (HDRS = 14–18), severe (HDRS = 19–22), and very severe (HDRS >22).
9
In so doing, they ignored the fact that symptoms differ from episodes: the typical major depressive episode (MDE) produces HDRS scores of at least ≥18. Thus, by using symptom criteria, all MDEs are by definition severe or very severe. Clinicians know that some patients meet MDE criteria and are still able to work; indeed those around them frequently do not even recognize that such a person is clinically depressed. Other patients are so severely depressed that they function poorly at work, and their companions recognize that something is wrong. Some clinically depressed patients cannot work at all, and still others cannot get out of bed for weeks or months on end. Clearly, there are gradations of severity within MDEs, and the entire debate in the meta-analysis being discussed here is about MDES, not depressive symptoms, since all patients had to meet MDE criteria in all the studies included in the meta-analysis (conducted by pharmaceutical companies for FDA approval for treatment of MDEs).The question, therefore, is not about severity of depressive symptoms but the severity of depressive episodes, assuming that someone meets Diagnostic and Statistical Manual of Mental Disorders (Fourth Edition) (DSM-IV) criteria for a MDE. On that question, a number of prior studies have examined the matter with the HDRS and with other depression rating scales, and the 3 groupings shown in Table II correspond rather closely to validated and replicated definitions of mild (HDRS <24), moderate (HDRS 24–28), and severe (HDRS >28) MDEs.
10
, 11
, 12
In other words, if one corrects for the statistical floor effect (which was also shown in the data reported by the authors
2
in a regression model correcting for baseline severity of illness), then the claim that antidepressants are effective only in the most extreme depressive conditions is disproven. Antidepressants are effective in moderate as well as severe depression.In sum, one can revise the conclusions of Kirsch et al after considering the analysis presented here. Instead of antidepressants being generally ineffective except in “the most severely depressed patients,”
2
the reality is that antidepressants are generally effective except in the mildest depressive episodes.Other Reanalyses of the FDA Meta-Analyses: Correction of Pooling Methods Increases Effect Size to Clinical Significance
Horder et al
8
also reanalyzed the dataset in the above meta-analysis2
and noted 2 errors in calculation of pooled effect size differences. In the original meta-analysis, the authors pooled all the antidepressant effect sizes (drug effect pre- and posttreatment), and then they pooled all the placebo effect sizes (pre- and posttreatment). They then subtracted these 2 pooled effect sizes. This is statistically incorrect. Pooled differences should be assessed within each study to maximally incorporate the benefits of randomization within each study. Thus, for a first study, the difference between drug and placebo should be calculated; for a second study, the same difference should be calculated, and so on. The pooled effect size for the meta-analysis should be the sum of each effect size difference between drug and placebo for each study, divided by the number of studies. Horder et al corrected the calculation using this approach to pooling effect size differences. They also used the absolute effect size difference on the HDRS, since all the studies used the same scales. They correctly noted that there is no need to use a standardized effect size measure (eg, Cohen's d) when all studies use the same outcome (HDRS); standardized effect sizes are used in an attempt to equalize different outcomes (eg, HDRS compared with different depression rating scales). By standardizing, the mathematical manipulations introduced may alter one's results somewhat, making them both less interpretable and less valid.Finally, Horder et al used the most valid measure of meta-analytic effect—the random effects model, as opposed to fixed effects, as in the original review.
2
Fixed effects models assume that all studies have similar variablities; when one is comparing studies of different drugs in different patient populations that vary in severity of illness, which the authors showed was an important predictor of response, then the fixed effects assumption is not valid. The random effects assumption includes the idea that studies differ from one another in important respects. Fixed effects models only correct for sample size and assume no other kinds of error, whereas random effects models introduce a second correction for presumed error.When making these 3 corrections—(1) pooling drug-placebo differences study by study, (2) using the absolute HDRS effect size difference only, and (3) using a random effects model for the meta-analytic summary—Horder et al found a much higher effect size (HDRS difference of 2.70, quite near the NICE cutoff of 3), as opposed to the clearly low HDRS difference effect size of 1.80 in the original meta-analysis.
Other Reanalyses
Fountoulakis and Möller
13
also reanalyzed Kirsch et al's meta-analysis.2
They made one correction, a weighting of the mean difference in each study for sample size. In so doing, they found a slightly larger effect size of 2.18 but not one large enough to meet the NICE criterion. They also reported that when examined by drug type, venlafaxine and paroxetine met the NICE criterion of 3-point improvement, but nefazodone and fluoxetine did not. We would add that the nefazodone studies all involved mild depression (no baseline HDRS >25), and thus lack of benefit may reflect mildness of depression per se (when natural history leads to rapid recovery, as discussed below), rather than inefficacy of the drug itself.Other discussions of these meta-analyses have included a commentary by Ioannidis,
3
who, like the group critical of antidepressants,2
concluded that these agents are largely ineffective. Ioannidis added a quantitative simulated analysis of a situation in which, if one assumes that the true effect size is small (eg, 0.20), then, with moderate or larger variability (due to small sample sizes), reported effect sizes would always be larger than the real effect size of 0.20. In other words, most effect sizes are probably inflated estimates of the real effect sizes, especially if studies are not large. Thus, if the debate is over whether 2.70 is close enough to the NICE threshold of 3 points, compared with 1.80, Ioannidis suggested that we should adjust conceptually for somewhat lower effect sizes than these exact numbers.His review has been challenged by Davis et al.
14
They emphasized that however one analyzes the antidepressant literature, the effect size of benefit with antidepressants over placebo is not 0. An effect size of 0.31 is a small effect size, but it is still an effect in some people. They pointed out that oncology studies, for instance, support the use of treatments with much smaller effect sizes because the conditions are otherwise terminal. They emphasized that because of the notable morbidity and mortality of severe depression, at least, any drug benefit is valuable. They based their conclusions on a narrative review of prior meta-analyses and major RCTs; their discussion of maintenance efficacy studies was uncritical, as will be discussed later.STAR*D
The sometimes rancorous debates about pharmaceutical industry studies are limited by the fact that they are pharmaceutical industry studies. They were conducted for FDA registration and to market drugs for profit, not to learn the truth in any economically disinterested fashion. This is why the huge NIMH-sponsored, double-blind, randomized STAR*D study is of major importance in addressing the question of antidepressant efficacy. It was conducted by academic sites that carefully organized and conducted their studies to meet NIMH standards, not by for-profit research groups that tried to meet pharmaceutical industry standards. The latter setting often involves paying patients to participate, sometimes at rather high rates, and there are well-known concerns about the misrepresentation of data to meet recruitment goals. Further, the FDA database involves pooling many different studies with different drugs in different study subjects, sometimes in different countries. The heterogeneity introduced by such differences is the bane of such large meta-analyses. Such heterogeneity is a type of confounding bias, making the results of these huge meta-analyses somewhat doubtful, since the pooled results of studies are not randomized. Only the data within each study is randomized and thus free of confounding bias.
This heterogeneity of data is not a minor issue, but it is one that many of the debaters ignore. A meta-analysis can never be taken at face value, because it is not randomized; meta-analyses are always observational and thus biased to a greater or lesser degree. All things being equal, a large single RCT is more valid than a meta-analysis, because the former is randomized and the latter is not.
Thus, a single huge RCT, such as STAR*D, is more valid, based on confounding bias concerns, than the huge FDA meta-analyses of multiple RCTs. The main limitation of STAR*D is the absence of placebo controls, which means it cannot be used to determine definitively whether antidepressants work better than placebo. However, if antidepressants were nothing but placebos, we would legitimately expect rather low response rates in STAR*D, especially in severe depression.
The main purpose of STAR*D was to learn which antidepressant treatments were effective in those who failed to remit initially with a single antidepressant trial. In stage 1, the antidepressant chosen was citalopram, a typical serotonin reuptake inhibitor, and it was given open label initially to identify nonresponders who were then randomized to various steps of other treatments. Perhaps not too surprisingly, initial response to citalopram was approximately 50% and initial remission about 30%.
15
The remaining subjects, who were all non-responders to stage 1, were then randomized to 3 sequential stages of treatment. They continued down the tree of options if they failed to remit in any phase for as long as they were willing to stay in the randomized studies. In the second stage of treatment (either switching to a different antidepressant or augmenting with one), a similar rate of acute response was seen (about 50%). However, by stages 3 and 4, despite using agents previously shown to be most effective (eg, tricyclic antidepressants and monoamine oxidase inhibitors or lithium augmentation), acute response rates ranged around 20%. Further, by stages 3 and 4, remission and response rates were about the same (ie, a better response was not seen with a more liberal definition of improvement than used for remission). As the authors of STAR*D comment, these results can be read as good news in the sense that one can conclude, with multiple phases of treatment, that about 60% or so of patients will respond acutely (>50% improvement in depressive symptoms).
1
This seems much higher than one would expect from natural history.It should be noted that after the initial citalopram treatment phase, STAR*D was a double-blind, randomized study (though without a placebo arm). All stages from 2 onward involved randomized, not observational, data, and the results are as valid as any standard, randomized clinical trial.
The results are not definitive, however, given the FDA database analyses, since the mean initial HDRS score was 21.8 in STAR*D, consistent with mild depression. In that group, one would expect much spontaneous recovery because of natural history or the nonspecific benefits of a placebo response. This possibility cannot be ruled out.
Maintenance Efficacy of Antidepressants in Major Depressive Disorders
Biases of the Enriched Design for Maintenance Efficacy
Before examining analyses of maintenance studies in MDD, it is useful to understand how such studies are designed to appreciate why they are mostly biased in favor of antidepressants.
Most maintenance studies of antidepressants begin, before the study begins, with patients who have an acute MDE and are treated with the antidepressant being studied. Patients who respond to the antidepressant are entered into the maintenance study, but those who do not respond or do not tolerate the antidepressant are excluded. Thus, the study is already biased in favor of the antidepressant. Then patients are followed for 1 to 2 years. The majority of patients relapse in the first 6 months of follow-up, however. This design does not prove maintenance efficacy, because the maintenance phase of treatment in MDD does not begin until 1 year after the acute episode ends, which is when the natural remission of an acute depressive episode occurs.
16
, 17
, 18
Thus, in the depression literature, there is a clear consensus that 1 year or longer is the relevant time frame to assess the prevention of new episodes. Even if not everyone agrees on the 1-year period, it would be reasonable to say that at least ≥6 months after the acute episode is needed to assess maintenance efficacy. Most maintenance RCTs fail to pass this simple test.This problem has been much discussed in the bipolar disorder literature,
18
and we have related it previously to the maintenance studies of neuroleptics in bipolar disorder.19
The early literature on lithium included both prophylaxis and relapse prevention methodologies. In the prophylaxis design, “all comers” were included in the study. Any patient who was euthymic, no matter how that person got well, was eligible to be randomized to drug versus placebo or control, including those with recent manic or depressive episodes. In the relapse prevention design, typically only patients who responded acutely to the drug being studied were then eligible to enter the randomized maintenance phase. Those who responded to the drug were then randomized to stay on the drug or be withdrawn from it (usually abruptly, sometimes with a taper) and switched to placebo.The prophylactic and relapse prevention designs obviously do not address the same questions about drug efficacy. In the lithium studies in which the relapse prevention design was used (ie, only initial lithium responders to acute treatment were included), there was evidence in the placebo group of lithium withdrawal following acute treatment.
20
, 21
By design, those who reach the maintenance phase and are treated with placebo are in fact persons who responded acutely to the study drug (lithium) and then were abruptly discontinued. Thus, if the placebo relapse rate is very high and almost exclusively limited to the first 1 to 2 months after study initiation, then one is observing a withdrawal effect involving a relapse back into the same acute episode that had just been treated rather than a new episode. The relapse prevention design methodology confounds prevention of relapse back into the index episode with prevention of a new episode.Besides the problem of withdrawal relapse, a key aspect of the relapse prevention design is that it is definitely biased in comparison with active controls and it is very likely biased against placebo as well. Although such studies are randomized, they are only randomized after preselecting all subjects to be randomized as responsive to only 1 of the 2 arms of the study. Thus, randomization is, in effect, instituted after the study has already been biased in favor of 1 of the 2 treatments. To put it simply, if some people like chocolate ice cream and others like vanilla and we preselect only those who like chocolate ice cream to be randomized again to receive chocolate ice cream or vanilla ice cream, we will find that most chocolate ice cream lovers will continue to prefer chocolate ice cream. This does not prove that chocolate ice cream is superior to vanilla ice cream.
The same principle applies to studies in which patients are preselected to respond to the study drug and later randomized to stay on the study drug or receive placebo. Again, the study would be biased in favor of the study drug and would not prove the inherent superiority of the study drug over placebo. A truly randomized study would have to either preselect subjects to be responsive to both treatments being studied or, as in the traditional prophylaxis study, make no preselection at all.
These inherent biases of the enriched maintenance design are key to analyzing meta-analyses of the maintenance antidepressant efficacy literature. None of those reviews, save one, addresses the relevance of the enriched design, and thus they draw incorrect conclusions, both for and against antidepressants.
Maintenance Randomized Clinical Trials
The standard review of the maintenance efficacy of antidepressants often involves reference to the Cochrane collaboration meta-analysis of published studies. In that report, 10 studies of serotonin reuptake inhibitors (n = 2080) and 15 of tricyclic antidepressants (n = 881), mostly with 1-year follow-up, showed maintenance benefit versus placebo.
4
The longest follow-up with modern antidepressants was 2 years with venlafaxine.4
An obvious problem with simply stating the results this way is that this meta-analysis does not address the issue of publication bias. If the acute antidepressant studies are any indicator, it is likely that some negative results from maintenance studies with antidepressants in MDD exist but are unpublished, and they would reduce this reported effect size.A more important issue is the problem of enriched maintenance designs, which bias studies in favor of drug enrichment (or placebo, if analyses are enriched in the opposite direction, as discussed later). The only analysis of RCTs of antidepressants in MDD that has addressed the problem of enrichment is a recent paper by Briscoe and El-Mallakh.
22
They address the problem of enrichment by limiting data analysis to ≥6 months after the acute depressive episode. By so doing, they exclude those who relapsed soon after the maintenance study started, right after the end of the acute episode. Those who received antidepressant and were switched to placebo would relapse rapidly in the first few months of the maintenance treatment. This discontinuation effect is an artifact of the enriched design and would not, in this view, demonstrate true recurrence of a new episode, but rather immediate relapse into the same episode that had been present in prior weeks. Only 5 RCTs provided data on relapse rates before and after 6 months. Limiting analyses to those studies, the researchers found that, given the biases of the enriched design, the majority of relapses (about two thirds) occurred in the first 6 months of follow-up. These were not new episodes of depression but withdrawal relapse into the same acute episode that had just occurred a few weeks or months earlier, before the maintenance study began. In the one third of relapses occurring after 6 months, and thus testing the proposition of whether new episodes were truly being prevented, 4 of 5 studies found no benefit with antidepressants over placebo.The Venlafaxine PREVENT Maintenance Study
Many authors cite a recent, long, large study of venlafaxine as evidence for antidepressant maintenance efficacy in MDD.
5
This study purports to show major benefits with venlafaxine for maintenance treatment of MDD, but it really reflects what we might call super-enrichment. The study repeatedly picks out those who respond to venlafaxine and re-randomizes them to venlafaxine or placebo, thus repeatedly selecting a smaller and smaller group of highly venlafaxine-responsive patients. By 2 years, this small group is indeed very responsive to venlafaxine, but the findings from this group are hardly generalizable to a new patient who might be prescribed venlafaxine.The specific data are as follows: In that study, 1096 MDD patients initially received venlafaxine or fluoxetine for acute depression. A total of 715 responders were enrolled in 6-month blind continuation on the same treatment. After 6 months, 258 (35.9%, 258/715) of those acute responders remained well and entered maintenance phase A for 1-year treatment (randomized to venlafaxine vs placebo).
23
After 1 year in maintenance phase A, 131 responders (83 venlafaxine, 48 placebo) entered phase B for a second year of maintenance (venlafaxine responders were re-randomized to venlafaxine versus placebo; placebo responders stayed on placebo, and fluoxetine responders stayed on fluoxetine).In the first year of maintenance treatment for the 258 responders, 23% of the venlafaxine-treated patients relapsed versus 42% of those receiving placebo. Thus, 77% of the venlafaxine group (n = 83) stayed well for 1 year after already preselecting those who had stayed well for 6 months (n = 258), who were selected after initially responding to treatment for an acute episode (n = 715), as described in the previous paragraph. This is only 11.6% (83/715) of initial sustained responders.
Only 12.5% of placebo responders at 1 year relapsed at 2 years, but in re-randomized venlafaxine responders (another super-enrichment on top of all the prior enriched selection phases), 44.8% of the placebo group relapsed at 2 years versus 8.0% on venlafaxine. Or, as the pharmaceutical industry marketing emphasized, 92% of venlafaxine patients remained well at 2 years' follow-up. This 92% seems like a huge number, but because of super-enrichment it represents the repeated selection of a tiny group of patients who were highly responsive to venlafaxine. It is 92% of the 11.6% mentioned earlier (those who responded at 1 year), which is 10.7% of the original sustained responders. Once dropouts are included, the number of patients treated at 2 years, after the initial sample of >1000 patients, were 15 using placebo and 31 using venlafaxine—4.2% of the original sample.
Antidepressant Discontinuation Meta-Analysis
The most recent review of the maintenance MDD literature represents a unique analysis.
6
Andrews et al essentially conducted an enriched study of placebo response, that is, they selected the data for analysis based on a sample enriched for placebo responders and biased against those who responded to drugs. They then concluded that drugs were ineffective and even harmful. All they really proved—once again—is that the enriched maintenance design is biased against whatever one wants to bias it against.This analysis is the converse of the standard enriched design maintenance study, as described previously, which is enriched for drug response and biased against placebo response. The same limitations apply in both cases: enrichment does not prove the inefficacy or harm of the treatment that is not being enriched, nor does it prove the efficacy or benefit of the treatment that is being enriched.
In this review, Andrews et al
6
collected 7 studies of maintenance treatment with antidepressants versus placebo in which initial acute treatment was provided with the 2 arms; in these 7 studies, the maintenance phase involved continuation of those patients who had responded to placebo acutely. In those acute placebo responders, relapse in the maintenance phase was (not surprisingly) uncommon (24.7%). In contrast, 39 trials involved acute treatment with antidepressant versus placebo, in which the reviewers selected patients who responded to antidepressants acutely and then were randomized to receive placebo in maintenance treatment. In this group, which reflected a discontinuation of antidepressant after acute response, there was a 42.1% relapse.The authors interpreted these results as indicating harm with the use of antidepressants—results that they speculatively relate to animal data on monoaminergic effects of these agents. They concluded that the biological effects of antidepressants actually increase the risk of relapse in long-term treatment, compared with the risk of no treatment (placebo). This interpretation ignores the problems of the enriched design, and, as a result, this kind of analysis highlights the importance of always comparing treatment results to what happens in the natural history of an illness.
This meta-analysis
6
enriches the results for placebo response. The patients treated acutely who responded to placebo stayed on placebo; the patients treated acutely who responded to antidepressants were taken off antidepressants. One should ask why these placebo responders responded to placebo. Did they actually respond to placebo, in the sense that the inert pill directly produced a response, or was placebo a stand-in for natural recovery–spontaneous remission, or part of the natural history of recurrent, episodic depression?The last is a possibility for part, if not all, of the placebo “response.” More than a century of natural history research, especially before the treatment era in past decades, has established the fact that recurrent unipolar depression follows an episodic course, in which there are periods of acute symptoms and periods of natural remission.
14
, 15
, 16
During periods of natural remission, patients stay well, often for years, without any treatment. The recovery of some patients on placebo, in those 7 studies, may well reflect natural cycling out of acute episodes in unipolar depression. Once patients have cycled out of acute episodes, they are in natural remission, which, in the case of recurrent unipolar depression, usually involves >1 year of remission before the next depressive episode.16
In the 7 placebo maintenance response studies, no study exceeded 12 months of follow-up; in reading the appendix attached to the meta-analysis, it appears that the mean duration of follow-up was <2 months in 6 of the 7 studies (range 1.4–1.9 months).In other words, the lack of relapse really means that a patient improved spontaneously from acute depression in a 2-month study (the usual duration of acute depression studies) and then remained well for another 2 months. This is not robust evidence of long-term stability on placebo but rather an indication that when spontaneous remission occurs from acute depression, it lasts at least 2 months (and indeed usually up to 1 year) without any treatment.
In contrast, in the antidepressant discontinuation studies analyzed, all patients responded in the acute treatment phase (usually 2 months in duration), and then 42% relapsed during maintenance treatment after the antidepressant was discontinued. One might ask whether serotonin withdrawal syndrome, which can mimic depressive episodes, occurred in some cases. Aside from that issue, however, a century of natural history research has led to a clear consensus that the mean duration of a typical depressive episode in unipolar depression is 6 to 12 months.
14
, 15
, 16
If a patient is treated to recovery at 2 months and then the treatment is stopped, such a patient will relapse into the mood episode rapidly, because the 6- to 12-month period of the biological persistence of a mood episode has not yet elapsed. This finding has been reported repeatedly with antidepressants in depression and with neuroleptics in mania.19
In sum, this creative analysis of the maintenance MDD literature suffers from a complete lack of awareness of the impact of the enriched design; the analysis is enriched for placebo response and thus biased against antidepressant effect. The most conceptually parsimonious and empirically well-supported interpretation of these findings, based on extensive clinical literature in human beings (as opposed to speculative biological extrapolations from animal studies), would be to view them as a result of the natural history of depression, not as a specific harm from antidepressants or a special benefit from placebo.
Maintenance Data in STAR*D
Although STAR*D is mainly reported in terms of acute data, it also provides maintenance data, which may be the best evidence to date on long-term efficacy with antidepressants in unipolar depression. Further, STAR*D was designed to be generalizable to the real world of complex, comorbid, recurrently depressed patients, as opposed to the cleaner populations studied in most RCTs (designed for FDA registration by the pharmaceutical industry).
As noted previously, STAR*D is a double-blind, randomized study; all the maintenance data after the first phase of treatment (ie, with the dozen or so antidepressant treatments given besides citalopram) involve randomized, not observational, data.
The basic results are as follows: Of subjects who responded acutely or remitted to antidepressants in STAR*D, only about one half stayed well at 1 year (sustained remission). In other words, by preselecting patients who have acute benefit with antidepressants, as noted earlier, one half will maintain benefit. Since 50% get acute benefit, and 50% of that group have sustained maintenance benefit, only 25% of the overall sample has long-term maintenance remission with antidepressants in unipolar depression.
1
Based on STAR*D findings, the long-term benefit with antidepressants in unipolar depression appears to be much less than has often been assumed.Objections to Our Critique of Enriched Maintenance Designs
The previous critique of enriched maintenance designs is neither widely known nor generally accepted. It is novel, rarely stated, and—when stated—strongly opposed by many researchers involved with maintenance studies in psychopharmacology.
There has not been much published discussion of this topic, but one objection that could be raised is that the enriched design is not biased because those who respond acutely to a drug treatment are both “true drug responders” and “placebo drug responders,” meaning that some of them would have responded to placebo had they been given placebo. Thus, the design is not biased solely toward the study drug. This objection would make sense only if all patients were equally likely to respond to drugs or placebo; if 50% of patients “really” responded to drug (true drug response) and 50% would have responded to placebo had it been given (placebo drug response), then a maintenance randomization of those acutely responsive patients to drug versus placebo would be valid. Ironically, this would be the case only if the critique of Kirsch et al
2
is correct, that is, if antidepressants are not more effective than placebo for acute depression.If antidepressants are more effective than placebo for acute depression in most patients, as we believe we showed earlier, then the percentage of true drug responders should be higher than the percentage of those who would have responded to placebo anyway (placebo drug responders). In a hypothetical group of acutely depressed patients treated with antidepressant X and later randomized to a maintenance study of X versus placebo, the reality is that there would not have been a 50–50 split between true drug responders and placebo drug responders before maintenance randomization. The split would be 60–40 or 70–30 or even higher in favor of drug X. In other words, because antidepressants are better than placebo acutely, enrichment for acute efficacy before maintenance RCTs is indeed biased in favor of antidepressants as opposed to later treatment with placebo. Enrichment entails bias.
Interestingly, many psychiatric researchers appear to understand this critique fully as applied to the maintenance meta-analysis by Andrews et al
6
; they appreciate that such an analysis entails “apples and oranges,” picking out placebo responders and comparing how they fared later when continued on placebo versus choosing drug responders and comparing how they later did when switched to placebo. Placebo responders are different from drug responders, it is said. We agree. All placebo responders, by definition, respond to placebo, whereas probably only some would respond to drugs. Thus, such analyses are biased in favor of placebo response.Although this enriched method (a species of selection bias that is unique to maintenance clinical trial design
19
) is rejected by many in our field in relation to the claim that placebo is as good as or better than an antidepressant, the same method is used to assert that antidepressants are more effective than placebo. The reason for such selectivity about accepting or rejecting the same research methodology is not entirely clear.Conclusions
Numerous reviews and meta-analyses of the antidepressant literature in MDD, both acute and maintenance, appear to make larger claims than their research methods allow. Specifically, based on the available FDA database analyses, it is false to claim that antidepressants are, in a general sense, ineffective in acute depressive episodes. The claim that they lack such benefits is disproved by standard valid methods of pooling effect size differences and by using appropriate meta-analytic models. Correction of those effect size difference for a floor effect, so that relative (instead of absolute) effect size differences are calculated, shows that antidepressant benefit is seen not only in severe depression, but also in moderate depression. These analyses confirm lack of benefit of antidepressants over placebo in mild depression. One can turn around the attention-getting conclusions of the review by Kirsch et al: Instead of concluding that antidepressants are ineffective acutely except for the most extreme depressive episodes, correction for the statistical floor effect proves that antidepressants are effective acutely except for the mildest depressive episodes. The claim that antidepressants are completely ineffective, or even harmful, in maintenance treatment studies involves an unawareness of the enriched design effect, which has been used to analyze placebo efficacy. The same problem exists for the standard interpretation of those studies, however; they do not prove antidepressant efficacy either, since they are biased in favor of antidepressants. In sum, in trying to make an objective and statistically valid assessment, we conclude that antidepressants are effective for acute depressive episodes that are moderate to severe but not mild. For maintenance efficacy, the research designs used have been biased in their favor, and it would seem more objective to conclude that long-term antidepressant efficacy is not proved but neither is the conclusion that antidepressants are harmful.
Conflict of Interest Statement
In the past 12 months, Dr. S. Nassir Ghaemi has received a research grant from the NIMH and from Pfizer, Inc. He provided one-time research consultations to Pfizer, Inc. and Sunovion, Inc. Neither he nor his family hold equity positions in these or other companies. Dr. Paul A. Vohringer has no financial disclosures of potential conflicts of interest to disclose.
Acknowledgments
This work was supported partly by grant 5R01MH078060 from the National Institute of Mental Health (S.N.G.) and a scholarship from the National Commission for Scientific and Technological Research (CONICYT) of the government of Chile (P.A.V.). The authors acknowledge the helpful input of Barney Carroll MD and Maurizio Fava MD for part of the manuscript. Both authors contributed equally to the conduct of the study and creation of the manuscript.
References
- Acute and longer-term outcomes in depressed outpatients requiring one or several treatment steps: a STAR*D report.Am J Psychiatry. 2006; 163: 1905-1917
- Initial severity and antidepressant benefits: a meta-analysis of data submitted to the Food and Drug Administration.PloS Med. 2008; 5: e45
- Effectiveness of antidepressants: an evidence myth constructed from a thousand randomized trials?.Philos Ethics Humanit Med. 2008; 3: 14
- SSRIs versus other antidepressants for depressive disorder.Cochrane Database Syst Rev. 2006; (CD001851)
- Assessing the efficacy of 2 years of maintenance treatment with venlafaxine extended release 75-225 mg/day in patients with recurrent major depression: a secondary analysis of data from the PREVENT study.Int Clin Psychopharmacol. 2008; 23: 357-363
- Blue again: perturbational effects of antidepressants suggest monoaminergic homeostasis in major depression.Front Psychol. 2011; 2: 159
- Selective publication of antidepressant trials and its influence on apparent efficacy.N Engl J Med. 2008; 358: 252-260
- Placebo, Prozac and PLoS: significant lessons for psychopharmacology.J Psychopharmacol. 2010 Jun 22; ([Epub ahead of print])
- Rush A.J. First M.B. Blacker D. Handbook of Psychiatric Measures. American Psychiatric Press, Washington, DC2000
- Differential effects of venlafaxine in the treatment of major depressive disorder according to baseline severity.Eur Arch Psychiatry Clin Neurosci. 2009; 259: 329-339
- Is severe depression a separate indication?.Eur Neuropsychopharmacol. 1999; 9: 259-264
- The Carroll rating scale for depression.Br J Psychiatry. 1981; 138: 205-209
- Antidepressant drugs and the response in the placebo group: the real problem lies in our understanding of the issue.J Psychopharmacol. 2011 Sep 17; ([Epub ahead of print])
- Should we treat depression with drugs or psychological interventions?.Philos Ethics Humanit Med. 2011; 6: 8
- Evaluation of outcomes with citalopram for depression using measurement-based care in STAR*D: implications for clinical practice.Am J Psychiatry. 2006; 163: 28-40
- Manic-Depressive Insanity and Paranoia.(Barclay RM, trans)in: Robertson G.M. E & S Livingstone, Edinburgh, UK1921
- Three-year outcomes for maintenance therapies in recurrent depression.Arch Gen Psychiatry. 1990; 47: 1093-1099
- Manic Depressive Illness.2nd ed. Oxford University Press, New York, NY2007
- Maintenance treatment study designs in bipolar disorder: do they demonstrate that atypical neuroleptics (antipsychotics) are mood stabilizers?.CNS Drugs. 2011; 25: 819-827
- Risk of recurrence following discontinuation of lithium treatment in bipolar disorder.Arch Gen Psychiatry. 1991; 48: 1082-1088
- Relapse into mania or depression following lithium discontinuation: a 7-year follow-up.Acta Psychiatr Scand. 2004; 109: 91-95
- The evidence for the long-term use of antidepressants as prophylaxis against future depressive episodes.in: Oral presentation at the American Psychiatric Association Annual Meeting, New Orleans, LaMay 22–26, 2010
- Prevention of recurrent episodes of depression with venlafaxine ER in a 1-year maintenance phase from the PREVENT Study.J Clin Psychiatry. 2007; 68: 1014-1023
Article info
Publication history
Published online: December 05, 2011
Accepted:
November 9,
2011
Identification
Copyright
© 2011 Elsevier HS Journals, Inc. Published by Elsevier Inc. All rights reserved.