|
Getting your Trinity Audio player ready...
|
Contents
- Glossary of Key Terms
- Key Takeaways
- What Is a Systematic Review?
- Assembling the Review Team
- Formulating Your Research Question
- Protocol Development and Registration
- Conducting the Literature Search
- Study Selection
- Data Extraction
- Critical Appraisal and Quality Assessment
- Synthesizing Results
- Reporting the Systematic Review
- Dissemination and Publication
- Software, Tools, and Emerging Technologies
- Common Pitfalls and How to Avoid Them
- When and How to Update a Systematic Review
- Frequently Asked Questions
Glossary of Key Terms
| Term | Definition |
| Systematic Review | A structured synthesis of all available evidence on a defined research question, using pre-specified, reproducible methods to identify, select, appraise, and summarize studies. |
| Meta-Analysis | A statistical technique used within or alongside a systematic review to pool quantitative data from multiple studies into a single effect estimate. |
| PICO/PICOS | A framework for structuring a clinical research question: Population, Intervention, Comparison, Outcome, and (optionally) Study design. |
| PRISMA | Preferred Reporting Items for Systematic Reviews and Meta-Analyses. The leading reporting guideline for systematic reviews, comprising a checklist and flow diagram. |
| PROSPERO | International Prospective Register of Systematic Reviews. A publicly searchable registry for systematic review protocols, hosted by the University of York. |
| Protocol | A pre-specified plan documenting the rationale, objectives, methods, and analysis approach of a systematic review, ideally registered before the review begins. |
| Inclusion/Exclusion Criteria | Pre-defined rules that determine which studies are eligible for the review and which are not, based on features such as population, design, outcome, and date. |
| Grey Literature | Research produced outside of traditional academic publishing channels, including government reports, conference abstracts, dissertations, and unpublished studies. |
| Critical Appraisal | A systematic evaluation of a study’s validity, results, and relevance, using structured tools to assess risk of bias and methodological quality. |
| Risk of Bias | The degree to which flaws in study design, conduct, or reporting may distort the results away from the true effect. |
| Data Extraction | The process of systematically retrieving pre-specified information from each included study into a standardized form. |
| Narrative Synthesis | A qualitative approach to combining study findings using text and tables when statistical pooling is not appropriate. |
| Heterogeneity | Variability among studies in their design, populations, interventions, or results, which affects whether meta-analysis is appropriate. |
| Inter-Rater Reliability | A measure of agreement between two or more independent reviewers at screening or data extraction stages, commonly expressed using Cohen’s Kappa. |
| Publication Bias | The tendency for studies with positive or statistically significant results to be published more readily than studies with null or negative findings. |
| Scoping Review | A type of evidence synthesis that maps the available evidence on a broad topic, identifying key concepts and gaps without formal quality appraisal. |
| Umbrella Review | A review of reviews that synthesizes multiple systematic reviews on a related topic to provide a higher-level evidence overview. |
Key Takeaways
- A systematic review is the gold standard for synthesizing evidence: it uses pre-specified, reproducible methods to minimize bias and provide a reliable summary of a research question.
- Register before you search: submitting a protocol to PROSPERO (or a comparable registry) before data collection begins increases transparency and reduces reporting bias.
- Team composition matters: at least two independent reviewers are required for screening and data extraction; a specialist librarian should be involved in search design.
- Use PICO(S) to define your question: a well-formed question anchors every downstream decision, from database selection to inclusion criteria.
- Search comprehensively: multiple databases, grey literature, citation chaining, and trial registries are needed to approach completeness.
- Document every decision: a PRISMA flow diagram and a clearly reasoned exclusion log allow readers to replicate and audit the review.
- Critical appraisal is non-negotiable: quality assessment does not exclude studies automatically but informs how their findings are weighted in synthesis.
- Choose the right synthesis method: meta-analysis is appropriate only when studies are sufficiently similar; narrative synthesis or vote-counting may be more honest alternatives.
- Report transparently using PRISMA: the PRISMA checklist and flow diagram are the minimum standard; extensions exist for specific review types.
- A systematic review takes time: from protocol development to publication, the process typically spans six months to two years.
What Is a Systematic Review?
A systematic review is a rigorous, pre-planned synthesis of all available evidence on a specific research question. Unlike a narrative or traditional literature review, every stage of a systematic review, from the search strategy to the inclusion decisions and quality assessments, is conducted using explicit, reproducible methods. This transparency is what distinguishes systematic reviews from other forms of evidence summary and elevates them to the top of the evidence hierarchy.
Systematic reviews serve several important purposes:
- Summarizing what is known about a clinical, policy, or scientific question
- Identifying gaps in the existing evidence base that warrant future research
- Providing a foundation for clinical guidelines, public health recommendations, and policy decisions
- Resolving apparent conflicts between individual studies
- Reducing duplication of research effort by mapping existing work
How Does a Systematic Review Differ from Other Review Types?
It uses a registered protocol, exhaustive searching, and formal quality appraisal, whereas other review types may not. The table below compares the most common review formats.
| Review Type | Question Focus | Quality Appraisal? | Typical Use Case |
| Systematic Review | Specific, narrow | Yes, formal | Guideline development, evidence synthesis |
| Meta-Analysis | Specific, narrow | Yes | Quantitative pooling of effect sizes |
| Scoping Review | Broad, exploratory | Usually not | Mapping a new field, identifying gaps |
| Rapid Review | Specific | Abbreviated | Time-sensitive policy decisions |
| Umbrella Review | Broad | Yes, of reviews | Synthesis of multiple systematic reviews |
| Narrative Review | Variable | No | Background sections, educational overviews |
Is a Systematic Review the Right Choice for Your Project?
Yes, when you need to inform policy or practice with the strongest possible evidence. However, a systematic review is not always appropriate. Consider a systematic review when:
- A clearly defined, answerable question exists
- Multiple primary studies on the topic have likely been published
- An evidence synthesis could meaningfully inform a decision
- Sufficient time and team resources are available (typically six months to two years)
A scoping review, rapid review, or narrative review may be more appropriate when the field is emergent, the question is broad, or time constraints are severe.
Assembling the Review Team
Systematic reviews require multi-disciplinary expertise. No single person can reliably conduct a rigorous systematic review alone: independent duplication at key stages is both a methodological safeguard and a requirement of most reporting standards.
Recommended team composition:
| Role | Core Responsibilities | Essential? |
| Principal Investigator / Lead Reviewer | Oversees the review, leads protocol development, makes final decisions on disagreements | Yes |
| Second Reviewer | Independently screens titles, abstracts, and full texts; independently extracts data; resolves inter-rater disagreements | Yes (minimum two reviewers required) |
| Subject Librarian or Information Specialist | Designs and executes the search strategy, selects databases, handles grey literature, peer-reviews the search (PRESS) | Strongly recommended |
| Statistician or Methodologist | Advises on meta-analysis methods, heterogeneity analysis, and sensitivity analyses | Required if meta-analysis planned |
| Content Expert(s) | Provides domain knowledge during protocol development, inclusion decisions, and interpretation | Recommended |
All team members should agree on roles, timelines, and authorship criteria at the outset. Where there is genuine disagreement between reviewers that cannot be resolved through discussion, a third reviewer or arbitrator should be pre-specified in the protocol.
Formulating Your Research Question
A precise, answerable research question is the foundation of every subsequent decision in the review process, from which databases to search to which outcomes to measure. Vague questions lead to irreproducible searches and unmanageable result sets.
What is the PICO(S) Framework?
PICO(S) is the most widely used framework for structuring questions in health and social science reviews:
| Letter | Element | Guiding Question |
| P | Population | Who are the participants? (e.g., adults with Type 2 diabetes, children aged 5-12) |
| I | Intervention | What is the exposure or intervention of interest? (e.g., metformin, cognitive behavioral therapy) |
| C | Comparison | What is the comparator? (e.g., placebo, usual care, alternative treatment) |
| O | Outcome | What are the outcomes of interest? (e.g., HbA1c reduction, quality of life, mortality) |
| S | Study Design | Which study designs will be included? (e.g., randomized controlled trials only, all observational designs) |
Example
A well-formed PICO question might read: “In adults with chronic low back pain (P), does mindfulness-based stress reduction (I) compared with physiotherapy (C) reduce pain intensity and disability at 12 months (O) in randomized controlled trials (S)?”
Alternatives to PICOS
Alternative frameworks include PICo (for qualitative research, replacing Intervention with phenomenon of Interest), SPICE (Setting, Perspective, Intervention, Comparison, Evaluation), and SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research type). The choice of framework should match the nature of the research question and the expected evidence base.
Qualitative & phenomenological: PICo
| Letter | Element | Meaning / example |
| P | Population | Who: e.g. adults living with HIV |
| I | phenomenon of Interest | What experience or concept: e.g. stigma experiences |
| Co | Context | Where/when: e.g. rural sub-Saharan Africa |
Best for: qualitative reviews exploring lived experience, attitudes, or meaning.
Health services & policy: SPICE
| Letter | Element | Meaning / example |
| S | Setting | Context of delivery: e.g. primary care clinics |
| P | Perspective | Whose viewpoint: e.g. patients, clinicians |
| I | Intervention | What is being evaluated: e.g. telehealth consultations |
| C | Comparison | Versus what: e.g. in-person visits |
| E | Evaluation | How success is measured: e.g. patient satisfaction scores |
Best for: service delivery, policy, or implementation questions where setting and stakeholder perspective are central.
Qualitative & mixed methods: SPIDER
| Letter | Element | Meaning / example |
| S | Sample | Who: e.g. postpartum women |
| PI | Phenomenon of Interest | What experience: e.g. breastfeeding decisions |
| D | Design | Study design: e.g. interviews, focus groups |
| E | Evaluation | Outcome or construct: e.g. self-efficacy |
| R | Research type | Qualitative, quantitative, or mixed |
Best for: qualitative or mixed-methods reviews; more sensitive than PICOS for qualitative searches.
Broad scoping & social science: ECLIPSE
| Letter | Element | Meaning / example |
| E | Expectation | What improvement is sought: e.g. reduce waiting times |
| C | Client group | Who benefits: e.g. elderly patients |
| L | Location | Where: e.g. NHS outpatient departments |
| I | Impact | Desired change: e.g. improved throughput |
| P | Professionals | Who delivers it: e.g. triage nurses |
| SE | Service | Type of service: e.g. emergency care |
Best for: health management, service improvement, and organizational research questions.
Diagnosis & test accuracy: PIRD
| Letter | Element | Meaning / example |
| P | Population | Who: e.g. adults with suspected PE |
| I | Index test | Test under evaluation: e.g. D-dimer assay |
| R | Reference standard | Gold standard: e.g. CT pulmonary angiography |
| D | Diagnosis | Target condition: e.g. pulmonary embolism |
Best for: diagnostic accuracy reviews; aligns with the QUADAS-2 appraisal tool.
Prognosis & prediction: PICOTS
| Letter | Element | Meaning / example |
| P | Population | Who: e.g. post-MI patients |
| I | Intervention | Prognostic factor or treatment: e.g. statin therapy |
| C | Comparison | Comparator: e.g. no statin |
| O | Outcome | Event of interest: e.g. 5-year mortality |
| T | Timeframe | Follow-up period: e.g. 5 years post-discharge |
| S | Setting | Care context: e.g. community cardiology |
Best for: prognosis reviews and intervention reviews where timing and setting are critical moderators.
Summary of PICOS Alternatives
- PICo strips out the comparison element entirely, making it suitable for qualitative syntheses where there is no “intervention vs. control” logic.
- SPICE foregrounds the stakeholder perspective and the setting, which makes it popular in health services and policy research.
- SPIDER adds study design as an explicit element and uses “Sample” instead of “Population,” which produces more sensitive searches in qualitative literature.
- ECLIPSE is organized around organizational expectations and service delivery rather than clinical intervention, useful for management and improvement research.
- PIRD is purpose-built for diagnostic test accuracy reviews and maps directly onto the QUADAS-2 appraisal tool.
- PICOTS extends PICOS with a timeframe and a setting element, which is particularly useful when prognosis or follow-up duration is a core part of the question.
The practical rule: if your question involves an experience or meaning, reach for PICo or SPIDER; if it involves a diagnostic test, use PIRD; if it involves a service or organization, try SPICE or ECLIPSE; and if timing is central, add the T and S of PICOTS.
Checking for Existing Reviews
Before finalizing your question, search PROSPERO, the Cochrane Database of Systematic Reviews, and databases such as PubMed to determine whether a recent, high-quality systematic review on the same question already exists. If one does:
- Consider whether an update is justified (new evidence, narrower scope, different population)
- Clearly articulate how your review differs from existing work
- Avoid contributing to research waste by duplicating recent high-quality reviews
Protocol Development and Registration
A protocol is a detailed, pre-specified plan for your systematic review. Writing and registering the protocol before searching begins is one of the most important steps in ensuring the integrity and transparency of the review.
Why Register a Protocol?
Registration reduces bias, increases transparency, and signals to the research community that your review is underway. A pre-registered protocol:
- Prevents post-hoc changes to outcomes or methods that could inflate apparent significance
- Allows readers and editors to identify deviations from the planned methodology
- Reduces duplication by alerting others to ongoing reviews
- Is increasingly required by journals as a condition of publication
Where and When to Register
PROSPERO (hosted by the Centre for Reviews and Dissemination, University of York) is the leading international registry for health-related systematic reviews. Registration must be completed before data extraction begins; ideally it should occur before screening starts. Other registries include OSF Registries (for social science and psychology), the Cochrane editorial management system (for Cochrane protocols), and field-specific registries.
What Should a Protocol Include?
A rigorous protocol should address the following elements:
- Background and rationale for the review
- Research question stated using PICO or equivalent framework
- Eligibility criteria for inclusion and exclusion of studies
- Databases to be searched and planned search strategy
- Grey literature sources to be consulted
- Process for study selection (number of reviewers, software to be used, handling of disagreements)
- Data extraction template and planned data items
- Risk of bias assessment tool and process
- Synthesis approach (narrative synthesis or meta-analysis) and any planned subgroup or sensitivity analyses
- Any planned assessment of publication bias
Any deviations from the registered protocol that occur during the review process should be transparently reported and justified in the final manuscript.
Sample PROSPERO registered protocol
You can download a sample PROSPERO registered protocol for a systematic review in endocrinology here:
The template covers all nine major sections of a PROSPERO-compliant protocol:
- Administrative details: title, registration, versioning, authors, funding, and conflicts of interest
- Background and rationale: the clinical problem, evidence gap, and stated objectives
- Research question (PICOS): each element populated for the GLP-1 RA vs. insulin in T2DM + CKD question
- Eligibility criteria: explicit inclusion and exclusion rules, with language and date policy
- Search strategy: databases, grey literature sources, and a sample MEDLINE string with MeSH and free-text terms
- Study selection: the two-stage screening process, software, disagreement handling, and inter-rater reliability
- Data extraction: form development, piloting, and a structured list of data items by category
- Risk of bias: RoB 2.0 for RCTs, ROBINS-I for observational studies, with traffic-light plot reporting
- Synthesis and reporting: narrative synthesis, meta-analysis conditions, heterogeneity thresholds, four pre-specified subgroups, three sensitivity analyses, publication bias assessment, GRADE, and dissemination plan
| Section | PROSPERO field | Key content (this protocol) | Notes / guidance |
| 1. Admin | Title, registration, authors, amendments, funding, COI | GLP-1 RA vs insulin in T2DM + CKD; PROSPERO pre-registration; multi-disciplinary team including librarian and statistician | Register before screening begins; log all amendments with date and rationale; declare all funding and conflicts |
| 2. Intro | Rationale and objectives | 537 million T2DM worldwide; CKD complicates glycemic management; no existing SR directly comparing GLP-1 RA vs insulin in CKD 3–5; primary objective: HbA1c reduction at 6 and 12 months | Preliminary search of PROSPERO and Cochrane must precede this section; state the gap explicitly; list primary and secondary objectives separately |
| 3a. PICOS | Eligibility criteria | P: Adults, T2DM, CKD stage 3a–5 non-dialysis (eGFR less than 60). I: Any GLP-1 RA (semaglutide, liraglutide, dulaglutide, exenatide, tirzepatide). C: Any insulin regimen (basal, basal-bolus, premixed). O: HbA1c; weight; eGFR; UACR; hypoglycemia; SAEs; mortality. S: RCTs and prospective cohorts; minimum 12 weeks; 2005 to present | Specify inclusion and exclusion criteria with equal precision; avoid post-hoc additions; state language and date policy explicitly |
| 3b–c. Search | Information sources and search strategy | MEDLINE, Embase, CENTRAL, CINAHL, Web of Science; ClinicalTrials.gov, WHO ICTRP, ADA/EASD/ASN conference abstracts; backward and forward citation chasing; PRESS peer review of strategy | Minimum 3 databases; grey literature is mandatory to reduce publication bias; provide full strategy for at least one database verbatim; rerun search before submission |
| 3d–f. Selection, extraction, RoB | Study selection, data collection, risk of bias | Dual independent screening in Covidence; Cohen’s Kappa reported; standardized extraction form piloted on 3 studies; RoB 2.0 for RCTs, ROBINS-I for cohorts; traffic-light plots via robvis | Two independent reviewers are non-negotiable; pre-specify arbitration process; pilot extraction form; do not exclude studies solely on RoB; report domain-level judgments |
| 3g–4. Synthesis and reporting | Data synthesis, subgroups, sensitivity, GRADE, dissemination | Narrative synthesis (SWiM); random-effects meta-analysis if 2 or more homogeneous studies; I² thresholds defined; 5 pre-specified subgroups; 4 sensitivity analyses; funnel plots if 10 or more studies; GRADE SoF table; PRISMA 2020 reporting; open data deposit via OSF or Zenodo | Justify pooling decision prospectively; pre-specify all subgroup and sensitivity analyses to avoid data dredging; GRADE is expected by most journals; share data and code on publication |
Conducting the Literature Search
The literature search is the engine of a systematic review. Its goal is to identify every study that might meet the inclusion criteria, regardless of where it was published, in what language, or whether its findings were positive or negative. A comprehensive, reproducible, and well-documented search is fundamental to the validity of the review’s conclusions.
Selecting Databases
No single database covers all relevant literature. A minimum of two to three major bibliographic databases should be searched; more are typically needed for a complete review. Recommended databases by field include:
| Database | Primary Coverage | Relevant Fields |
| MEDLINE / PubMed | Biomedical and life sciences | Medicine, nursing, pharmacy, public health |
| Embase | Biomedical, pharmacological | Medicine, drug research, clinical trials |
| CINAHL | Nursing and allied health | Nursing, physiotherapy, occupational therapy |
| PsycINFO | Psychology and behavioral sciences | Mental health, cognitive science, education |
| Cochrane CENTRAL | Controlled trials | All clinical intervention research |
| Web of Science | Multidisciplinary | Science, social science, arts and humanities |
| Scopus | Multidisciplinary | Science, technology, social science |
| ERIC | Education | Pedagogy, educational policy, learning science |
Designing the Search Strategy
The search strategy translates the PICO question into database-searchable syntax. Key principles include:
- Use controlled vocabulary: Subject headings (MeSH in MEDLINE, EMTREE in Embase) capture concepts regardless of exact wording. Always combine subject headings with free-text keywords.
- Apply Boolean operators: Use AND to combine different concepts (e.g., population AND intervention), and OR to combine synonyms and related terms within a concept.
- Use truncation and wildcards: These retrieve multiple word endings from a root (e.g., “therap*” retrieves therapy, therapist, therapeutic).
- Avoid over-restriction: Do not apply date limits, language limits, or study design filters unless specifically justified; these can systematically exclude relevant evidence.
- Document every component: Record the exact search string run in each database, including the date of the search, to allow replication.
- Seek peer review of the search: The Peer Review of Electronic Search Strategies (PRESS) checklist provides a framework for a second librarian or information specialist to review the strategy before it is run.
Searching Grey Literature and Supplementary Sources
Published, peer-reviewed studies represent only a portion of the available evidence. Failure to include grey literature can artificially inflate effect size estimates due to publication bias. Grey literature sources include:
- Trial and study registries: ClinicalTrials.gov, WHO International Clinical Trials Registry Platform, EU Clinical Trials Register
- Government and regulatory agency websites (e.g., CDC, FDA, EMA, NICE)
- Conference proceedings and abstract books
- Dissertations and theses (ProQuest Dissertations and Theses, EThOS)
- Preprint servers (medRxiv, bioRxiv, PsyArXiv) for recent, un-peer-reviewed work
- Google Scholar for supplemental discovery (not as a primary database due to lack of reproducible search syntax)
- Reference lists of included studies (backward citation chasing)
- Articles citing key included studies (forward citation chasing)
- Contact with study authors or key experts to identify unpublished or ongoing work
Common Errors in Literature Search
| Error | Why it matters | Example | How to avoid it |
| Searching too few databases | Each database has unique coverage; relying on one source misses a significant proportion of relevant studies | Searching only PubMed for a nursing intervention review, missing studies indexed exclusively in CINAHL | Search a minimum of three major databases plus at least one grey literature source; justify every database included and excluded in the methods section |
| Poorly constructed Boolean logic | Incorrect use of AND/OR inverts the intended search, either inflating results uncontrollably or excluding entire concept blocks | Using AND between synonyms for the same concept (e.g., “liraglutide AND semaglutide”) instead of OR, effectively limiting results to studies mentioning both drugs | Map each PICOS concept to a separate block; combine synonyms within a block using OR; combine blocks using AND; have a second person trace the logic before running |
| Missing controlled vocabulary | Free-text searching alone misses records indexed under a subject heading but not using your exact keywords in the title or abstract | Searching “heart attack” as free text but omitting the MeSH term “Myocardial Infarction” | Combine MeSH (MEDLINE), EMTREE (Embase), and CINAHL Subject Headings with free-text synonyms for every major concept; check the database thesaurus before finalizing the strategy |
| Inadequate synonym coverage | Concepts are expressed with multiple spellings, abbreviations, and regional variants; missing these creates gaps in retrieval | Searching “type 2 diabetes” but omitting “T2DM,” “NIDDM,” “non-insulin-dependent diabetes,” and “adult-onset diabetes” | Use truncation and wildcards (e.g., diabet*); consult existing systematic reviews on the topic to harvest synonyms; run the strategy past a subject librarian |
| Applying overly restrictive filters | Date limits, language filters, and study-design filters applied at the search stage silently exclude relevant records before a human ever screens them | Adding “English only” and “2015 to present” filters to the database search, missing older landmark studies and non-English trials | Apply filters at the screening stage rather than the search stage; if restrictions are unavoidable, justify them explicitly in the protocol and report them as a limitation |
| Neglecting grey literature | Unpublished or non-commercially published studies skew disproportionately toward null or negative results; omitting them inflates apparent effect sizes | Searching only peer-reviewed databases and missing a large government-funded trial reported only on ClinicalTrials.gov and in a conference abstract | Search trial registries (ClinicalTrials.gov, WHO ICTRP), regulatory databases, conference abstracts, and dissertations; contact study authors for unpublished data |
| Failing to peer-review the search strategy | Search strategies in published systematic reviews have error rates above 70%; logical, spelling, and truncation errors go undetected without independent review | A truncation symbol entered incorrectly (e.g., “diabet:” instead of “diabet*” in PubMed) silently retrieves zero results for an entire concept block without triggering an error message | Apply the PRESS 2015 checklist; have a second independent librarian or information specialist review every strategy before it is run; document the peer review as a supplementary file |
Study Selection
Study selection is the process of applying your pre-specified eligibility criteria to the results of the literature search to arrive at the set of studies to be included in the review. It is conducted in two sequential stages.
Stage 1: Title and Abstract Screening
All records retrieved from the search are screened at the title and abstract level by at least two independent reviewers. Records are classified as include, exclude, or uncertain. Any record that cannot be confidently excluded at this stage should be retained for full-text review. Disagreements between reviewers should be resolved through discussion, with escalation to a third reviewer if consensus cannot be reached.
Stage 2: Full-Text Eligibility Assessment
Records retained from Stage 1 are retrieved in full text and assessed against the complete eligibility criteria. Reasons for exclusion at this stage must be recorded for each excluded study and reported in the PRISMA flow diagram. Records should be managed using dedicated software:
- Covidence: The most widely used tool for systematic review management; supports dual screening, full-text review, and data extraction.
- Rayyan: A free web-based tool suitable for title and abstract screening with a blinding feature.
- EPPI-Reviewer: Supports complex reviews and machine-learning-assisted screening.
- Endnote / Zotero: Reference management tools often used for deduplication before screening begins.
Measuring Agreement Between Reviewers
Inter-rater reliability should be calculated and reported at each screening stage, most commonly using Cohen’s Kappa coefficient. Kappa values are interpreted as:
| Kappa Value | Level of Agreement |
| Less than 0.20 | Slight agreement |
| 0.21 to 0.40 | Fair agreement |
| 0.41 to 0.60 | Moderate agreement |
| 0.61 to 0.80 | Substantial agreement |
| 0.81 to 1.00 | Almost perfect agreement |
A Kappa of 0.61 or above is generally considered acceptable for systematic review purposes, though the threshold depends on the consequences of misclassification for the specific review.
The PRISMA Flow Diagram
The PRISMA flow diagram provides a visual audit trail of the study selection process. It must be included in the final report and should document:
- Total records identified from each database and additional source
- Number of duplicates removed
- Number screened at title and abstract stage and number excluded
- Number assessed for full-text eligibility and number excluded with reasons
- Number of studies included in the final review
Sample PRISMA Flow Diagram
Data Extraction
Data extraction is the systematic retrieval of pre-specified information from each included study. The goal is to capture all data needed for synthesis, quality appraisal, and reporting in a standardized, reproducible manner.
Designing the Data Extraction Form
A data extraction form (also called a data collection form or charting form) should be piloted on two to three included studies before being used in full. The form should be developed by the review team collaboratively and should capture the following categories of information:
| Category | Example Data Items |
| Bibliographic | Authors, year, journal, country of publication, funding source |
| Study Design | Design type, randomization method, blinding, allocation concealment |
| Population | Sample size, age, sex/gender, diagnosis or condition, inclusion and exclusion criteria |
| Intervention / Exposure | Type, dose, duration, frequency, delivery setting, comparator details |
| Outcomes | Outcome measures, measurement tools, time points, follow-up duration |
| Results | Effect sizes, confidence intervals, p-values, sample sizes per group, attrition |
| Quality / Bias | Risk of bias domain ratings (from formal appraisal tool) |
Data extraction should be performed independently by two reviewers, with discrepancies resolved through discussion or referral to a third reviewer. Where data are ambiguous, missing, or inconsistently reported, authors of the original studies should be contacted for clarification. All decisions should be documented.
Common Difficulties in Data Extraction & Next Steps
| Difficulty | Why it occurs | How to resolve it |
| Missing or unreported data | Authors omit standard deviations, denominators, or subgroup breakdowns, particularly in older publications | Contact the corresponding author with a specific data request; use the standard error or confidence interval to back-calculate SD; note the gap transparently and assess its impact on synthesis |
| Inconsistent outcome definitions across studies | Different studies measure the same construct with different tools, time points, or thresholds (e.g., hypoglycemia defined as plasma glucose below 3.0 vs. 3.9 mmol/L) | Pre-specify how definitional variants will be handled in the protocol; group studies by outcome definition in the narrative synthesis; conduct sensitivity analyses restricted to studies using the same definition |
| Multiple publications from the same study | A single trial generates a primary paper, secondary analyses, and long-term follow-up reports, inflating apparent study counts and risking double-counting of participants | Link all reports to a single study record during deduplication; extract data from the most complete report and supplement with secondary papers; document all linked publications in the included-studies table |
| Discrepancies between two independent extractors | Reviewers interpret ambiguous text, tables, or figures differently, producing conflicting values for the same data item | Resolve through structured discussion referencing the original text; escalate to a pre-specified third reviewer if consensus is not reached; calculate and report inter-rater agreement after a pilot extraction exercise |
| Data presented only in figures or graphs | Numerical values are embedded in bar charts, Kaplan-Meier curves, or forest plots without accompanying tables | Use validated digitizing software (e.g., WebPlotDigitizer) to extract approximate values; document the extraction method and its inherent imprecision; contact authors for the underlying data |
| Unit or scale inconsistencies | Studies report the same outcome in different units (e.g., eGFR in mL/min vs. mL/min/1.73 m², HbA1c in % vs. mmol/mol) or use different Likert scale directions | Convert all values to a common unit before pooling using validated conversion formulas; document every conversion in the extraction form; flag studies where conversion introduces meaningful imprecision |
Critical Appraisal and Quality Assessment
Every included study must be assessed for methodological quality and risk of bias. Critical appraisal is not a gate-keeping mechanism to exclude imperfect studies; rather, it provides the information needed to weight evidence appropriately and identify sources of variability in the review findings.
What Tools Should Be Used for Critical Appraisal?
Tool selection should be matched to the study design and pre-specified in the protocol. The most widely used tools are:
| Tool | Study Design | Key Domains Assessed |
| Cochrane RoB 2.0 | Randomized controlled trials | Randomization, deviations from intervention, missing outcome data, measurement, selective reporting |
| ROBINS-I | Non-randomized studies of interventions | Confounding, selection bias, classification of interventions, missing data, outcome measurement, selective reporting |
| QUADAS-2 | Diagnostic accuracy studies | Patient selection, index test, reference standard, flow and timing |
| CASP Qualitative Checklist | Qualitative studies | Research design, sampling, data collection, reflexivity, ethical issues, rigor |
| Newcastle-Ottawa Scale | Cohort and case-control studies | Selection, comparability, exposure or outcome assessment |
| AMSTAR-2 | Systematic reviews (for umbrella reviews) | Protocol, search, study selection, data extraction, risk of bias, synthesis |
How to Handle Studies with High Risk of Bias
High risk of bias does not automatically disqualify a study from inclusion. The appropriate response depends on the overall evidence base:
- Include the study but note the limitations in the narrative synthesis
- Conduct a sensitivity analysis excluding high-risk studies to assess their influence on pooled estimates
- Use the Grading of Recommendations Assessment, Development and Evaluation (GRADE) framework to formally downgrade the overall certainty of evidence when risk of bias is a concern across included studies
Synthesizing Results
Synthesis brings together the data extracted from individual studies to produce conclusions about the body of evidence as a whole. Every systematic review must include at least a narrative synthesis; quantitative synthesis (meta-analysis) is an additional option when studies are sufficiently similar.
Narrative Synthesis
Narrative synthesis uses structured text and tables, rather than statistical pooling, to summarize and integrate findings. It is appropriate when:
- Studies are too heterogeneous in design, population, or outcome to be pooled
- Outcomes are qualitative or not amenable to statistical combination
- The number of included studies is too small for robust pooling
A rigorous narrative synthesis should not be a simple list of individual study results. It should actively compare and contrast studies, identify patterns and divergences, consider how context, population, or design differences may explain variation, and draw evidence-based conclusions about the overall state of knowledge.
When Is Meta-Analysis Appropriate?
Meta-analysis is appropriate when included studies are clinically and methodologically similar enough that pooling their results is a meaningful scientific act rather than a statistical convenience. Before conducting a meta-analysis, address the following:
- Clinical homogeneity: Are the populations, interventions, comparators, and outcomes sufficiently similar?
- Statistical heterogeneity: Use the I² statistic (values above 50% indicate substantial heterogeneity) and the Chi² Q-test to assess variability in effects between studies.
- Model selection: A random-effects model is generally preferred when studies are drawn from different settings; a fixed-effect model assumes all studies estimate one true effect.
- Subgroup analyses: Pre-specify any planned subgroup analyses (e.g., by age group, dose, or risk of bias) to explore sources of heterogeneity.
- Sensitivity analyses: Assess the robustness of pooled estimates by repeating analyses with high-risk studies excluded, or using alternative statistical models.
Assessing Publication Bias
Publication bias occurs when studies with positive or statistically significant results are more likely to be published than those with null findings, leading to an overestimation of effects in the published literature. Methods for assessing publication bias include:
- Funnel plots: A scatter plot of effect size against a measure of study precision; asymmetry in the funnel suggests possible publication bias. Require at least 10 studies for meaningful interpretation.
- Egger’s test: A statistical test for funnel plot asymmetry.
- Trim and fill method: An approach that estimates the number of missing studies and adjusts the pooled estimate accordingly.
- Comprehensive grey literature searching: The most proactive safeguard against publication bias is a thorough search that includes unpublished and grey literature from the outset.
Assessing the Overall Certainty of Evidence Using GRADE
The GRADE (Grading of Recommendations Assessment, Development and Evaluation) framework provides a systematic approach for rating the overall certainty of a body of evidence for each outcome. Evidence begins at a rating based on study design and may be downgraded or upgraded:
| GRADE Rating | Starting Point | Can Be Downgraded For |
| High | Randomized controlled trials | Risk of bias, inconsistency, indirectness, imprecision, publication bias |
| Moderate | Downgraded RCTs or upgraded observational studies | Same five factors |
| Low | Observational studies | Same five factors |
| Very Low | Downgraded observational studies or case series | Same five factors |
Reporting the Systematic Review
Clear, transparent reporting is essential so that readers can assess the validity of the review, replicate the methods, and apply the findings appropriately. The primary reporting standard for systematic reviews is the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement.
The PRISMA Statement
The PRISMA 2020 update comprises a 27-item checklist organized around the following sections of a systematic review report:
- Title: Identify the document as a systematic review (and meta-analysis where applicable).
- Abstract: Provide a structured summary covering background, objectives, data sources, eligibility criteria, risk of bias assessment, synthesis methods, results, limitations, conclusions, and registration details.
- Introduction: Describe the rationale for the review and explicitly state the research question.
- Methods: Detail eligibility criteria, information sources, search strategy (including the full search string for at least one database), selection process, data extraction process, risk of bias assessment methods, and synthesis methods.
- Results: Present the PRISMA flow diagram, characteristics of included studies, risk of bias assessment results, and synthesized findings.
- Discussion: Interpret findings in context, discuss limitations, and present conclusions.
- Registration: Provide the protocol registration number and any deviations from the registered protocol.
PRISMA extensions are available for specific review types, including PRISMA-P (protocols), PRISMA-NMA (network meta-analysis), PRISMA-IPD (individual participant data), PRISMA-DTA (diagnostic test accuracy), and PRISMA-ScR (scoping reviews).
Key Sections of the Methods Chapter
The methods section is the most critical component for reproducibility. It should allow an independent team to replicate the review exactly. Checklist for a complete methods section:
- Eligibility criteria stated explicitly (population, intervention, comparator, outcomes, study design, language, date range if applicable)
- Complete search strategy for at least one database presented verbatim
- All databases and supplementary sources listed, with dates searched
- Screening process described (number of reviewers, software, handling of disagreements)
- Data extraction process described (form used, number of reviewers, piloting)
- Risk of bias tool named and process for assessment described
- Synthesis approach justified (narrative synthesis or meta-analysis, with statistical methods specified)
- Any planned subgroup analyses, sensitivity analyses, or assessment of publication bias described
- GRADE assessment planned or reasons for exclusion of GRADE stated
Dissemination and Publication
Completing a systematic review without disseminating the findings wastes the effort invested and denies the research community access to a potentially important evidence synthesis. Dissemination strategies should be considered from the outset.
Choosing a Target Journal
Selection of a target journal should be guided by:
- The subject area and audience most likely to use the findings (clinicians, policymakers, researchers)
- Whether the journal accepts systematic reviews and has published similar work previously
- Impact factor and indexing in major bibliographic databases
- Open-access requirements of funders and institutions
- Adherence to PRISMA as a condition of submission
Journals that specialize in systematic reviews include Systematic Reviews (BioMed Central), the Cochrane Database of Systematic Reviews, and the Campbell Collaboration library. Many field-specific journals also publish high-quality systematic reviews.
Handling Peer Review
Peer reviewers of systematic reviews commonly focus on:
- Adequacy and comprehensiveness of the search strategy
- Consistency of inclusion criteria application
- Appropriateness of risk of bias tool selection and application
- Justification of the synthesis approach
- Transparency of deviation from the registered protocol
Responses to reviewers should address each comment systematically, referencing specific changes made in the manuscript with page and line numbers.
Common Journal Peer Reviewer Concerns for Systematic Reviews.
| Reviewer concern | Underlying reason | How to address it |
| Search strategy is incomplete or poorly documented | Reviewers cannot assess reproducibility if databases, date ranges, or the full search string are absent or vague | Provide the verbatim search string for every database searched as a supplementary file; list all databases, grey literature sources, and date ranges; include the PRESS peer review form |
| No protocol registration or late registration | Absence of a pre-registered protocol raises suspicion of outcome switching or post-hoc methodological decisions | Register in PROSPERO before screening begins; cite the registration number in the abstract and methods; explain and justify any deviations from the registered protocol transparently |
| Single reviewer used for screening or extraction | Using one reviewer introduces subjective bias that undermines the core methodological advantage of a systematic review | Report that two independent reviewers conducted all screening and extraction stages; provide Cohen’s Kappa at each stage; describe the arbitration process for disagreements |
| Risk of bias assessment is superficial or tool is mismatched | Applying the wrong tool (e.g., Newcastle-Ottawa for RCTs) or reporting only overall judgments without domain-level reasoning weakens appraisal credibility | Select and justify the tool for each study design (RoB 2.0 for RCTs, ROBINS-I for observational studies); report domain-level judgments for every included study; present traffic-light plots |
| Meta-analysis conducted despite substantial heterogeneity | Pooling clinically or statistically heterogeneous studies produces a misleading average that may obscure more than it reveals | Report I² with confidence intervals and the Chi² Q-test; justify the decision to pool or not pool; explore heterogeneity through pre-specified subgroup and sensitivity analyses; consider narrative synthesis if I² exceeds 75% |
| PRISMA flow diagram or checklist is incomplete | Missing counts, absent exclusion reasons, or an unpopulated checklist prevent readers from auditing the selection process | Provide a fully populated PRISMA 2020 flow diagram with reasons for every full-text exclusion; submit the completed PRISMA checklist as a supplementary file; number checklist items to manuscript page and line |
| Conclusions overreach the strength of the evidence | Authors draw strong recommendations from low-certainty evidence or small numbers of studies without acknowledging limitations | Apply the GRADE framework and explicitly state the certainty rating for each outcome in a Summary of Findings table; calibrate conclusion language to the evidence grade; dedicate a paragraph to limitations including publication bias |
Other Dissemination Channels
Beyond journal publication, consider:
- Conference presentations and posters to reach practitioners and researchers early
- Policy briefs or lay summaries for non-specialist audiences
- Preprint posting (e.g., medRxiv) to make findings available while under peer review
- Sharing data extraction files and screening decisions as supplementary materials or on open repositories (OSF, Zenodo, Figshare) to facilitate replication and updating
Software, Tools, and Emerging Technologies
Recommended Software by Task
| Task | Recommended Tools | Notes |
| Protocol registration | PROSPERO, OSF Registries | Free; PROSPERO is health-specific |
| Reference management and deduplication | Endnote, Zotero, Mendeley | Deduplication before importing into screening tools is essential |
| Screening and data extraction | Covidence, Rayyan, EPPI-Reviewer | Covidence integrates with Cochrane; Rayyan is free |
| Statistical analysis (meta-analysis) | RevMan 5, R (meta package), Stata (metan) | RevMan is free and used by Cochrane; R and Stata offer greater flexibility |
| Risk of bias assessment | RoB 2.0 web tool, ROBOTreviewer | ROBOTreviewer uses machine learning to assist RoB assessment |
| GRADE assessment | GRADEpro GDT | Free; produces Summary of Findings tables |
| PRISMA flow diagram | PRISMA2020 R package, Lucidchart, draw.io | PRISMA2020 package generates diagrams directly from screening counts |
The Role of Artificial Intelligence in Systematic Reviews
Artificial intelligence and machine learning tools are increasingly used to assist with systematic review tasks, particularly those that are repetitive and high-volume. Current evidence-supported applications include:
- Title and abstract screening assistance: Machine learning classifiers can prioritize records most likely to be relevant, reducing the number of records a human reviewer must screen (known as active learning or technology-assisted review). These tools include ASReview, EPPI-Reviewer’s AI plugin, and Rayyan’s machine learning feature.
- Risk of bias automation: Tools such as ROBOTreviewer use natural language processing to extract risk of bias signals from trial reports.
- Data extraction assistance: Large language models are being evaluated for structured data extraction, though human verification remains essential.
AI tools do not replace the need for human judgment and dual review. Any use of AI in the review process should be transparently described in the methods section, including the tool used, how it was applied, and what human oversight was maintained.
Common Pitfalls and How to Avoid Them
| Pitfall | Why It Matters | How to Avoid It |
| Poorly defined PICO question | Leads to unfocused searches, inconsistent eligibility decisions, and conclusions that cannot be applied | Spend time refining the question before any other step; pilot the criteria on sample records |
| Searching too few databases | Misses relevant studies, introducing selection bias | Search a minimum of three major databases plus grey literature sources |
| Errors in the search strategy | Published studies show error rates above 70-90% in systematic review search strategies | Use PRESS peer review; involve an experienced librarian in strategy design |
| Single reviewer screening or extraction | Introduces subjective bias into study selection and data capture | Mandate dual review at every stage; calculate and report inter-rater agreement |
| No protocol or late registration | Allows post-hoc outcome switching and reduces credibility | Register in PROSPERO before the search begins |
| Inappropriate meta-analysis | Pooling clinically heterogeneous studies produces a misleading average | Assess clinical and statistical homogeneity before pooling; use narrative synthesis if heterogeneity is high |
| Failing to assess publication bias | Review conclusions may overestimate beneficial effects | Search grey literature; produce funnel plots when 10 or more studies are available |
| Incomplete PRISMA reporting | Prevents replication and reduces credibility with journals and readers | Complete the PRISMA checklist before submission; share it as a supplementary document |
When and How to Update a Systematic Review
Evidence accumulates over time, and a systematic review that is current today may become outdated as new primary studies are published. Most systematic reviews should be considered for updating every two to five years, or sooner if:
- A significant new trial or body of evidence has been published in the field
- A new intervention, comparator, or outcome has become clinically relevant
- The existing review’s conclusions have been challenged by subsequent evidence
- Guidelines based on the review are being revised
The update process should follow the same rigorous methodology as the original review, with a new or updated protocol registered in PROSPERO. The update search is typically run from the date of the last search in the original review. Changes in methodology or scope between the original and updated review should be transparently documented.
Living systematic reviews represent a more resource-intensive model in which the review is continuously updated as new evidence becomes available, often using semi-automated screening and rapidly cycling publication cycles. This model is particularly relevant for fast-moving fields, such as emerging infectious diseases.
Frequently Asked Questions
Can I conduct a systematic review on my own?
Conducting a rigorous systematic review as a solo researcher is strongly discouraged and is inconsistent with most published standards. At minimum, a second independent reviewer is required for screening and data extraction to protect against subjective bias. That said, some institutions allow solo reviews for degree projects provided that methods are clearly documented, bias limitations are acknowledged, and an experienced supervisor or collaborator reviews key decisions.
Do I need PROSPERO registration even for a student project?
Registration is not legally mandated, but it is strongly recommended for any systematic review regardless of context. Journals increasingly require PROSPERO registration as a condition of submission. Even for a student project, registration demonstrates methodological rigor, prevents duplication of effort, and protects your work by establishing priority. Registration is free and typically takes less than an hour to complete.
How do I handle studies published in languages other than English?
Restricting a systematic review to English-language studies can introduce language bias, particularly in fields where important research is published in German, French, Spanish, or other languages. Best practice is to search without language restrictions, then translate or arrange translation for non-English studies that meet your eligibility criteria. If translation is not feasible, the language restriction must be transparently disclosed as a limitation.
What is the difference between a “reason for exclusion” and a “reason for inclusion”?
Inclusion is binary: a study either meets all eligibility criteria or it does not. Only one reason for exclusion needs to be recorded for each excluded full-text study (the primary or most decisive reason). A common error is recording multiple reasons per study; the convention is to apply criteria hierarchically and stop at the first unmet criterion. Reasons should be presented in aggregate in the PRISMA flow diagram, not listed individually for every excluded record.
How many studies do I need to include for a valid systematic review?
There is no minimum number of included studies required for a valid systematic review. A review that identifies zero studies satisfying the eligibility criteria is informative: it reveals a gap in the evidence base. Reviews with very few studies can still be valid, though conclusions must be drawn cautiously. Meta-analysis requires a minimum of two studies, and funnel plot interpretation is unreliable with fewer than ten. The validity of a review depends on the rigor of its methods, not its yield.
Should I use ChatGPT or other AI tools to help screen studies?
General-purpose large language models such as ChatGPT are not validated for systematic review screening and should not replace human dual review. They may introduce systematic errors or hallucinate eligibility decisions. Purpose-built tools such as ASReview, Rayyan’s machine learning feature, and Cochrane’s RCT classifier are designed and evaluated for these tasks. Any AI assistance in the screening process must be fully described in the methods section, and human verification of AI-assisted decisions is non-negotiable.
What is the difference between risk of bias and quality assessment?
“Risk of bias” is the preferred contemporary term for what was historically called “quality assessment.” Risk of bias refers specifically to systematic error in study results due to flaws in design, conduct, or reporting. The term “quality” is broader and may encompass reporting quality, external validity, or methodological rigor. Modern tools (RoB 2.0, ROBINS-I) assess risk of bias in specific domains and assign a judgment of low, some concerns, or high risk. A study can be well-reported but still have a high risk of bias, and vice versa.
How should I handle disagreements between co-reviewers that cannot be resolved by discussion?
Irresolvable disagreements between two reviewers should be escalated to a pre-specified third reviewer or arbitrator, who makes the final decision. This process should be described in the protocol and methods section before the review begins. In practice, persistent disagreements often signal ambiguity in the eligibility criteria themselves; revising and clarifying criteria during a calibration exercise before formal screening begins can substantially reduce downstream conflict.
What is the difference between a systematic review and realist review?
A systematic review aggregates evidence to determine whether an intervention works, typically pooling quantitative data and ranking studies by design quality (RCTs preferred). A realist review asks why, for whom, and under what circumstances an intervention works, using theory-driven synthesis of mixed evidence. Systematic reviews control for context; realist reviews treat context as central. The output of a systematic review is usually an effect estimate; a realist review produces refined programme theory in the form of Context, Mechanism, Outcome (CMO) configurations.

Comment