Virtual Mentor. January 2011, Volume 13, Number 1: 46-51.
Rating Evidence in Medical Literature
Not all medical evidence is equally reliable. An overview of the grading systems used for sorting evidence on the basis of study design and quality by the Oxford Center for Evidence-Based Medicine, the U.S. Preventive Services Task Force, and the journal of the American Academy of Family Physicians.
Opeyemi O. Daramola, MD, and John S. Rhee, MD, MPH
A 24-year-old medical student comes to your clinic having had purulent rhinorrhea for 14 days, preceded by symptoms of upper respiratory infection. She reports having facial pain, frontal headache, nasal congestion, fever, and overall malaise. Nasal endoscopy reveals inflamed nasal mucosa with significant edema bilaterally. There is purulent rhinorrhea in the left middle meatus. Both cheeks are tender to the touch. You prescribe a 10-day course of amoxicillin and daily use of an intranasal steroid spray. She agrees with the use of amoxicillin but questions your nasal steroid recommendation. She proceeds to ask you about the effectiveness of intranasal steroids as adjunctive therapy and the strength of reported evidence supporting this recommendation.
An eager learner observing a seasoned physician will often probe the origin of the physician’s recommendation. Today’s patients are encouraged to seek more education about their health. Thus, they are not shy about questioning their physician’s recommendations. If the efficacy of an intervention has been established, how does it compare to available alternatives? How does one reach conclusions about the strength of relevant comparisons?
A concise and widely cited definition of evidence-based medicine (EBM) was formulated by David Sackett, one of its pioneers . Sackett and colleagues define EBM as the “conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients” . In practice, the provision of compassionate EBM reflects the integration of evidence from research, wisdom from clinical experience, and respect for the patient’s values and preferences, while recognizing existing circumstances [2, 3]. Most journals and specialty academies are dedicated to the continuous pursuit of high-quality studies and explicit grading recommendations in order to provide effective guidelines to physicians .
To understand the strength of guidelines and management strategies, one must be familiar with the different levels of evidence. The Oxford Centre for Evidence-Based Medicine (OCEBM) provides a popular scale for stratifying evidence from strongest to weakest on the basis of susceptibility to bias and the quality of the study design . A modified and condensed version of the OCEBM scale is presented in table 1. A similar hierarchy is used by the U. S. Preventive Services Task Force in grading evidence [6, 7].
Table 1. Modified presentation of the Oxford Centre for Evidence-Based Medicine levels of evidence .
SR: systematic review; RCT: randomized controlled trial.
Randomized controlled trials (RCTs) are considered the gold standard in modern medicine for determining the efficacy of a treatment. Individual RCTs are level 1b evidence. Systematic reviews of homogenous RCTs are regarded as the highest level of evidence—level 1a. These systematic reviews consist of information synthesized from individual, well-designed RCTs where participants are similar and have equal chances of being assigned to an intervention group, a control group, or a placebo group. Systematic reviews of trials with blinded investigators and subjects (i.e., double-blinded RCTs) are even more desirable than reviews of non-double-blinded trials. These studies go through rigorous measures to eliminate bias, but they tend to be expensive and time-consuming.
In the case of our medical student, a literature search would reveal a published Cochrane Database systematic review of double-blinded RCTs. This review reported that intranasal corticosteroids (INCS) had been found to be effective as monotherapy or as adjunctive treatment when compared to placebo treatment for acute rhinosinusitis . This review examined 475 studies but excluded 471. In the selected four studies, which had a robust total of 1,943 participants, those treated with INCS had earlier resolution or improvement of symptoms than those receiving a placebo. This systematic review selected high-quality, double-blinded placebo-controlled RCTs with homogenous design, clear reporting of outcomes, and an adequate number of subjects to establish clinical significance.
Cohort studies are considered level 2b evidence. In this design, a population (cohort) is defined according to the presence or absence of a variable that may potentially influence the occurrence of a specific disease. Cohort studies can be prospective or retrospective. In prospectivecohort studies, people at risk for certain diseases are followed over time to investigate trends or risk factors in those who get the disease. Predictor variables are measured before outcomes occur. In retrospective cohort studies, the sample is defined and predictor variables are reported after the outcomes have occurred. Epidemiology studies that compare outcomes of people who had a certain exposure to unexposed subjects are examples of cohort studies.
Suppose you are counseling a 35-year-old woman whose husband is addicted to smoking tobacco about the risk of environmental tobacco smoke (ETS) on cardiovascular health. Because the deleterious effects of smoking tobacco are well-established, it would be unethical to perform a RCT to answer this question. An appropriate cohort study, such as one performed by Iribarren et al., would be the highest level of study that can be performed ethically and pragmatically to address the question in this scenario . Iribarren et al. investigated the independent effect of exposure to environmental tobacco smoke (ETS) on the risk of stroke among 27,698 lifelong nonsmokers. They found that 20 hours or more a week of ETS exposure at home (compared to less than 1 hour a week) was associated with a 1.29-fold and a 1.50-fold increased risk of first ischemic stroke among men and women, respectively.
In matched-case control studies (level 3b evidence) investigators retrospectively evaluate two groups—one group with disease and the other without disease—with the intent of finding risk factors or trends. Subjects are matched for age, sex, and other demographics. For example, in a Swedish nationwide study, Lagergren et al. convincingly demonstrated that people who have weekly symptoms of esophageal reflux disease were eight times more likely to have adenocarcinoma of the esophagus than matched subjects without these symptoms . In other words, these investigators looked for the prevalence of reflux (predictor variable) among subjects with confirmed esophageal adenocarcinoma(cases) and compared it to the prevalence of reflux symptoms in a sample of those who did not have adenocarcinoma of the esophagus (control).
A case report that provides information on the diagnosis, intervention, and outcome for a single individual is level 4 evidence. Case series—articles written about a series of patients with a specific diagnosis—are also regarded as level 4 evidence. Both case reports and case series describe characteristics of patients with certain diseases and may help identify questions for future research. These studies are ranked lower than other designs because of associated bias, lack of random sampling, the absence of controls or a comparison group, and heterogeneity of subjects. While these studies do not meet criteria necessary for achieving higher evidence level status, they are quite common in reporting outcomes in surgical specialties. Some diseases treated by surgical intervention (or nonintervention) do not lend themselves well to the higher level study designs previously mentioned. For example, performing sham surgeries for the sake of a controlled trial is ethically unacceptable. Systematic review of case series and case reports are helpful in identifying trends that lead to positive outcomes in diseases with high morbidity or that are treated surgically.
Grading Evidence in Medical Literature
Different specialty academies and journals have historically adopted unique systems to grade medical evidence and indicate the strength of disease-specific treatment guidelines [4, 6]. Grading systems arm physicians with information to help them make consistent, well-informed decisions and limit disparities in health care. Each system has its own shortcomings. A detailed explanation of the disadvantages of each system is beyond the scope of this article. (The reader is referred to a review paper by David Atkins et al., which appraised six prominent systems for grading levels of evidence ).
In 2002, the Agency for Healthcare Research and Quality (AHRQ) conducted a review of available methodologies for grading the strength of a body of scientific evidence . This review identified three important characteristics to consider in assigning a grade to studies: quality, quantity, and consistency. Quality, as discussed above, refers to the methodologic rigor or extent to which bias was minimized in a study. Consistency refers to the similarities in design, population, outcome, and data analysis in studies attempting to answer the same question. Quantity refers to the number of subjects in individual studies and number of studies included in reviews. Seven systems fully addressed these key elements .
Grading of recommendations is useful when there is a need for a consensus guideline regarding the approach to a particular disease. Systematic reviews report the levels of evidence present in given studies and then assign grades to recommendations from these studies that reflect the strength of the intervention and likelihood of a successful outcome. The OCEBM system has grades of recommendations. Under this scheme, a grade A is a strong recommendation for or against an intervention. After critical appraisal, well-designed level 1a to 1c studies tend to result in grade A recommendations, level 2a to 3b studies result in grade B recommendations, and recommendations derived from level 4 studies are typically labelled grade C. Level 5 studies or “troubling,” “imprecise” studies at any level above 5 generate grade D recommendations (table 2). For example, recommendations from expert opinion without objective critical appraisal tend to be regarded as inconclusive and cannot be given a grade stronger than D.
Another popular grading system is the Strength of Recommendation Taxonomy (SORT) used by the journal of the American Academy of Family Physicians . While the algorithms behind these systems are not identical, the outcomes are fundamentally similar. The simplified version in table 2 underrepresents the complexity of the system, and the reader is encouraged to peruse the algorithm behind these grading systems [4-6, 11].
Table 2. Similarities between the SORT and OCEBM grading systems.
*SORT: Strength of Recommendation Taxonomy