St. John's Wort and Major Depressive Disorder: A Closer Look

Paul Jones, Ed.D.
University of Nevada, Las Vegas

A study of the effectiveness of St. John's wort, (Shelton et al., 2001) published April 18, 2001, in the Journal of the American Medical Association (JAMA) immediately generated headlines and brief stories in newspapers throughout the U.S. A typical headline was:"Study: St.John's Wort Ineffective". A typical lead sentence in the news stories was: "St. John's wort, the popular herbal remedy, is useless for alleviating severe depression, according to the first large study to evaluate it in the United States".

After reading the story in our local paper, I was both curious and puzzled. The finding did not appear consistent with the extensive research literature about the effectiveness of the herb, and the implications were not consistent with my own experience of satisfaction in using it. Critics of the study were, in fact, quick to respond, one noting that the study was limited to participants with major depression, another questioning the possible influence of the sponsor (Pfizer, a pharmaceutical company which sells the antidepressant, Zoloft).

When questioned about the inconsistency with the prior research, the senior author of the study is reported to have characterized the prior studies as just "poorly designed". A representative of the American Psychiatric Association described the current study as "rigorous and sophisticated".

Always looking for good examples to use in my graduate research methods course and also to satisfy my own curiosity, I obtained a copy of the full report. I anticipated that a thorough review of the research report would provide a template that could be used with my students on how to appropriately design a rigorous study to answer an important question. What I found, instead, was a near perfect, textbook example of how easily research data can be distorted for an exaggerated, if not erroneous, interpretation.

Before justifying that last statement, I would note that I have neither need nor intent to denigrate the research team who conducted the study. They are obviously well qualified, and they appear to have "followed the rules" in designing the study. I also recognize that researchers cannot always be held accountable for the manner in which their findings are reported in the popular press. I do, though, believe that researchers, through what is emphasized and not emphasized in the summary of findings, can influence what is reported in the popular press. In this study, as will be demonstrated below, the outcomes that were not emphasized may well be more significant than those that were.

This review/critique will not focus on the limitation that this study included only participants with major depressive disorders. That fact was emphasized by the researchers, and the prefix "major" was typically included in the news reports. I would point out, however, that the typical newspaper reader is unlikely to be aware that there are a number of depression diagnoses, affecting the lives of countless individuals, which are sufficiently severe to impair daily function but which do not meet criteria for identification of "major depressive disorder".

Show Me The Data

The "shameless" adaptation of a popular phrase in this heading carries an important message for all who wish to make data-based decisions about the use of medications and/or nutritional supplements. Said most simply, "you'd better read the whole thing." In this specific instance, the data reported in the actual JAMA article are not what the summaries would lead one to believe.

To illustrate, the clear implication in the news reports (including the reported comments by the senior author) is that in an eight-week trial there was no difference in the outcome when participants who used St. John's wort were compared with those who received a placebo. That is incorrect. In actual fact, there was a difference, and the difference was in favor of the sample who received the herb. The authors even note in their summary that the St. John's wort sample produced a significantly greater proportion of symptom remission when compared to the placebo sample.

The authors, however, also report that their primary data analysis of the effectiveness of the herb was negative (thus, the headlines). That report, while "technically correct", can be quite misleading. There actually also was a difference in the primary data analysis, a difference which again was positive in favor of the sample who received St. John's wort.

To clarify, a brief digression into the "rules of research" is necessary. A researcher wants to investigate whether one condition (e.g. St. John's wort) results in an outcome different from another condition (e.g. placebo). A participant sample is obtained, and data are gathered.

The analysis of the data is most often in the form of testing a "null hypothesis". The null hypothesis predicts that there will be no difference in the outcome from the two conditions; the question then is whether the actual data allow the researcher to reject the null hypothesis.

The question above actually becomes two questions: 1) is there any difference in the actual outcomes from the two conditions in the sample data, and 2) if there is a difference, is the observed difference likely to have occurred by chance alone. In real life, the answer to the first question is typically "yes"; there is almost always some observed difference in the outcome. In this study, that difference was in favor of the St. John's wort sample.

The answer to the second question, however, is also crucial. The data for the analysis came, of course, from a sample, not from the whole population. To address the second question, the researcher in effect asks the following: "If in the total population there would really be no difference in the outcome of the two conditions, what are the odds that I might get the difference I see in the results from this sample?"

The gold standard for the second question was inherited from agricultural science in the last century. That standard is identified as a "p" value and the numerical standard is .05. If the "p" value is less than .05, the researcher rejects the null hypothesis and acknowledges that the two conditions produced different outcomes. If the "p" value is equal to or greater than .05, the researcher "fails to reject" the null hypothesis.

Putting this all together, a researcher almost always will see a difference in outcome associated with the research conditions. That difference can be identified as statistically significant only when the odds are less than 5 in 100 that the difference might have occurred from a sample if the "real" difference in the whole population would have been zero.

In this case, the better outcome observed in the St. John's wort group on the primary analysis was a difference which could have occurred by chance more than 5 times in 100 if there really would have been no difference in the whole population. Thus, the researchers correctly reported that the difference between the St. John's wort group and the placebo group was not statistically significant.

It is important to note, though, that while the "less than 5 times in a hundred" is the traditional standard for research, it is not without controversy. More than a few researchers believe that the standard is at best overly conservative and at worst a totally inappropriate way to evaluate important data. Cohen (1994), for example, rejects the entire concept of null hypothesis testing. Rosnell and Russell (1989) question the arbitrary .05 line, stating "... surely, God loves the .06 nearly as much as the .05."

The clear implication in the reports of this research study is that their failure to reject the null hypothesis in the primary analysis translates to a recommendation not to use St. John's wort for symptoms of major depression. Parkhurst (1985), however, cautioned against such interpretation, noting: "Failing to reject a null hypothesis is distinctly different from proving a null hypothesis; the difference in these interpretations is not merely a semantic point. Rather, the two interpretations can lead to quite different biological conclusions" and in a later work (1990) stated directly that failure to reject a null hypothesis should not be used as justification for taking actions that would be appropriate if the null hypothesis was proven true.

In this particular case, the "p" value for the comparison between the St. John's wort group and the placebo group was equal to .16. Said another way, the primary analysis showed that the St. John's wort group had more positive outcomes, and the amount of difference predicts that this outcome would have occurred only approximately 16 times out of 100 if the population difference was zero. Many of us (certainly those who live and/or play in Las Vegas) would suggest that a "safe" bet can be placed with far less overwhelming odds.

What Exactly Is The Evidence

So far, we have identified from the actual research report that the analysis of symptom remission found a significant difference in favor of the St. John's wort group. We have further identified that the technique chosen by the researchers as their primary data analysis tool also found a difference in favor of that group, a difference which many would define as unlikely to have occurred by chance alone.

And, there is more. Graduate students are taught that answering the question in the heading above requires a careful look at what else might have caused/contributed to the results. The basic idea is that the comparison groups should be "different" from each other only in the experimental condition (e.g. herb vs. placebo). The technique used to accomplish this equality is that persons will be randomly assigned to receive one or the other of the two conditions. The assumption, particularly in relatively large samples, is that this will result in the two groups being comparable on other factors that might affect the results. Thus, any difference in the results would be caused by the experimental condition.

The researchers in this study appear to have taken the appropriate steps for random assignment. But, randomization does not guarantee equality; it only increases the chances that the groups will be equal on other relevant features. In this study, did randomization succeed in equalizing the relevant features? Were these two groups equal on such features? Two factors reported in the study suggest that the answer to these questions might be "no", the groups may in fact not have been equal on other factors with impact on the results.

All participants in the study met diagnostic criteria for identification with major depressive disorder. There are, however, two primary subcategories of this disorder, single-episode and recurrent. The difference between the two is as implied in their names. The individual has had only one identified episode of major depression, or the individual has a history of recurrence of episodes of depression. Many, perhaps most, clinicians would likely note that the former are far easier to treat than the latter. In this study, the distribution was similar but not equal. Approximately 38% (37.8) of the placebo sample had single-episode major depression diagnoses. Approximately 32% (31.6) of the St. John's wort group were diagnosed with single-episode disorder.

Participants who were concurrently receiving psychotherapy were allowed to participate in the study if the frequency of therapy sessions did not change during the period of the study. Although very few participants were concurrently participating in psychotherapy, they also were not equally balanced between the two groups: eight of the participants in the placebo group had concurrent psychotherapy during the study; only four of the participants in the St. John's wort group were also receiving psychotherapy.

Could these differences have influenced the results? The report does not specify whether these specific features were included in the analysis of comparability, and the differences between the groups were not large. But, it is certainly not unreasonable to suspect that persons with recurrent depressive disorder could have had somewhat lower positive response as compared to persons with single-episode disorder. And, one would hope that persons receiving concurrent psychotherapy might have at least slightly more favorable outcome than those who were not. In either case, any effect in this study would have been more likely to favor the placebo group. The actual difference, if any, would probably not have been large. The gap between a .16 and a .05 significance level, however, is also not large and consider how different the headlines would have been had the magic .05 level have been reached.

Eye of the Beholder

As noted at the beginning of this paper, its intent was not to demean the researchers. Large-scale clinical trials are difficult to accomplish, and the researchers obviously went to extraordinary efforts in attempt to obtain a valid design. That having been said, it also seems clear that the reported interpretation of the outcome of the study is far more than appears supported by the actual data. In fact, the exact same data set could have been used to support a headline, "St. John's Wort Shown To Be Effective Even With Major Depressive Disorders". The lead sentence could have been: "Despite conditions which may have favored the placebo group, significant differences in remission rate were found in the group that received the herbal treatment, and the primary analysis tool also showed a difference in favor of this group."

Whether a less conservative standard for identifying statistical significance should be employed is a topic which will generate controversy among (and probably be of interest only to) statisticians for the forseeable future. One generally agreed position, though, is that whether a less rigid standard than the .05 level should be accepted is contingent on the risk of being wrong.

So, finally, what would have been the risk, in this case, of incorrectly rejecting the null hypothesis? The authors conclude that persons with significant major depression should not be treated with St. John's wort. Their own data, however, showed that during an eight week period, there was a statistically significant positive outcome for the participant sample, all of whom had major depressive disorder, approximately half of whom were taking St. John's wort. They reported that the approximately half who took St. John's wort had a better outcome with a difference that would be expected only about 16 times in 100 by chance alone. They further reported that St. John's wort was safe and well-tolerated by the participant sample.

One is left to ponder how the researchers arrived at their conclusion. Their rationale for making such a definitive statement that does not appear supported by their own data is unclear. What is clear, though, is that even results of studies which are praised for their "rigor and sophistication" cannot be taken at face value alone. Both for this and any future studies, "show me the actual data" would be a prudent demand.

References

Cohen, J. (1994). The earth is round (p<.05). American Psychologist 49: 997-1003.

Parkhurst, D.F. (1985). Interpreting failure to reject a null hypothesis. Bulletin of the Ecological Society of America 66: 301-302.

Parkhurst, D. (1990). Statistical hypothesis tests and statistical power in pure and applied science. Acting Under Uncertainty: Multidisciplinary Conceptions, GM von Furstenberg, ed., Kluwer Academic Publishers, Boston. pp. 181-201

Rosnell, R.L. and Rosenthal, R. (1989). Statistical procedures and the justification of knowledge and psychological science. American Psychologist 44: 1276-1284.

Shelton, R.C., et al. (2001) Effectiveness of St. John's wort in major depression: A randomized controlled trial. Journal of the American Medical Association 285: No. 15, 1978-1986.

Study: St. John's Wort ineffective. Las Vegas Sun, April 17, 2001. Retrieved April 21, 2001, from the World Wide Web: http://www.lasvegassun.com/sunbin/stories/archives/2001/apr/17/041705639.html

Vedantam, S. (2001). St. John's Wort ineffective, large study finds. Washington Post OnLine, April 18, 2001. Retrieved April 21, 2001, from the World Wide Web: http://washingtonpost.com/wp-dyn/health/A29599-2001Apr17.html

_____________________________________________________________________________

Dr. Paul Jones is Professor and Senior Research Scientist in the Department of Educational Psychology, University of Nevada, Las Vegas. He is a licensed psychologist (Nevada and New Mexico) with more than sixty publications in professional journals, including articles on statistical analysis and recent books on the use of the Diagnostic & Statistical Manual of Mental Disorders & on Evaluating Research Studies.


Click Here to Close Window