Diagnostic test studies: assessment and critical appraisal
There are many checklists available for the assessment and critical appraisal of diagnostic test studies, as reporting is frequently inadequate.[1][2] However, they all include some variation of three critical questions;[2][3] these are:
- Is this study valid?
- Does the diagnostic test under assessment accurately distinguish between people who do and do not have the disorder?
- Can I apply this valid, accurate diagnostic test to a specific patient?
Assessment
How do we assess if a diagnostic test study is valid?
We can assess whether our study is valid by considering these questions:
1. Was there an independent, blind comparison with a reference (gold) standard of diagnosis? What does that mean?
- That patients in the study should have undergone both the index diagnostic test and the reference (gold) standard. Why? To confirm or refute the findings of the index test. The accuracy of the test can be overestimated if you perform the index test initially in people that you know have the disease and then separately in healthy people (case-control studies do this) rather than performing both the index and reference tests in the same group of people without knowing whether or not they have the disease you are trying to diagnose.[4]
- That the people assessing the results of the index test are blind to the results of the reference standard. Why? To avoid biasing the results of the index test or the reference standard. Interpreting the results of the reference test while already knowing the results of the index test can lead to an overestimation of the index test accuracy, especially if the reference test is open to subjective interpretation.[4] Blinding is less important if the results of the test are objective (e.g., serodiagnostic tests for tuberculosis where sputum culture results are analyzed) than if results require clinical interpretation (e.g., MRI images for diagnosing rotator cuff injury).
2. Was the diagnostic test evaluated in an appropriate spectrum of patients (like those a clinician would see in practice)? What does that mean?
- Did the study include people with all the common presentations of the target disorder, with symptoms of early manifestations as well as more severe symptoms, and/or people with other disorders that are commonly confused with the target disorder when diagnosing? Why? Studies that only include people with obvious symptoms versus people with no symptoms are not very useful! If you can diagnose something by eye, why would you need a diagnostic test?
3. Was the reference standard applied regardless of the index diagnostic test result? What does that mean?
- If the patient has a negative index test result, the investigators sometimes do not carry out the reference standard test to confirm the negative result, especially if the test is invasive or risky, as this may be unethical. To overcome this, investigators employ an alternative reference standard for proving that the patient does not have the target disorder, which is a long-term follow-up to assess that there are no adverse effects associated with the target disorder present without any treatment. Why? To confirm the accuracy of the index test: in other words, that the negative result of the index test is, in fact, the correct result for the patient and he/she definitely doesn’t have the disease.
4. Was the test validated in a second independent group of patients? What does that mean?
- When a new diagnostic test is evaluated, there is a risk that the results in the initial assessment are caused by other factors: for example, something about that specific group of patients included in the study (e.g., they represent only patients with advanced symptoms of the disease). So, to prove the results are reliable and replicable, the new diagnostic test should be evaluated in a second independent (or test) group of patients. Why? If the results in this second group of patients are similar to the results in the first group of patients, then we can be reassured about the test accuracy. If no test set study has been carried out, then maybe we need to reserve judgment.
In conclusion: If the study that we are evaluating fails any of these 4 criteria, we need to consider whether the flaws of the study make the results invalid.
How do we assess the results of the test?
There are two types of result commonly reported in diagnostic test studies. One concerns the accuracy of the test and is reflected in the sensitivity and specificity, often defined as the test's ability to find true positives for the disorder (sensitivity) or true negatives for the disorder (specificity). An ideal diagnostic test finds no false positives but at the same time misses no one with the disease (finds no false negatives) — much easier said than done!
The other concerns how the test performs in the population being tested and is reflected in predictive values (also called post-test probabilities) and likelihood ratios. To give brief definitions of these terms consider this example (based on reference[5]):
1000 elderly people with suspected dementia undergo an index test and a reference standard. The prevalence of dementia in this group is 25%. 240 people tested positive on both the index test and the reference standard and 600 people tested negative on both tests. The remaining 160 people had inaccurate test results.
The first step is to draw a 2x2 table as shown below. We are told that the prevalence of dementia is 25%; therefore, we can fill in the last row of totals — 25% of 1000 people is 250 — so 250 people will have dementia and 750 will be free of dementia. We also know the number of people testing positive and negative on both tests and so we can fill in two more cells of the table.
By subtraction we can easily complete the table:
Now we are ready to calculate the various measures.
Term | Definition | Example |
Pre-test probability = (true positive + false positive)/total number of people | This measure tells us the probability of having a target condition before a diagnostic test | In this example: 390/1000 = 0.39 What does this mean: The probability of a patient in this study having dementia before the tests are run |
Sensitivity (Sn) = the proportion of people with the condition who have a positive test result | The sensitivity tells us how well the test identifies people with the condition. A highly sensitive test will not miss many people | In our example, the Sn = 240/250 = 0.96 What does that mean? 10 (4%) people with dementia were falsely identified as not having it, as opposed to the 240 (96%) people who were correctly identified as having dementia. This means the test is fairly good at identifying people with the condition |
Specificity (Sp) = the proportion of people without the condition who have a negative test result | The specificity tells us how well the test identifies people without the condition. A highly specific test will not falsely identify many people as having the condition | In our example, the Sp = 600/750 = 0.80 What does that mean? 150 (20%) people without dementia were falsely identified as having it. This means the test is only moderately good at identifying people without the condition |
Positive predictive value (PPV) = the proportion of people with a positive test who have the condition | This measure tells us how well the test performs in this population. It is dependent on the accuracy of the test (primarily specificity) and the prevalence of the condition | In our example, the PPV = 240/390 = 0.62 What does that mean? Of the 390 people who had a positive test result, 62% will actually have dementia |
Negative predictive value (NPV) = the proportion of people with a negative test who do not have the condition | This measure tells us how well the test performs in this population. It is dependent on the accuracy of the test and the prevalence of the condition | In our example, the NPV = 600/610 = 0.98 What does that mean? Of the 610 people with a negative test, 98% will not have dementia |
Likelihood ratio for positive results (LR+) = sensitivity/the % of people falsely identified as having the disorder | This measure tells us how well the test performs in this population. It is dependent on the accuracy of the test for positive results (sensitivity) and the proportion of people falsely identified as having the target condition A likelihood ratio of >1 indicates the test result is associated with the disease | In this example the LR+ = 96/20 = 4.8 What does that mean? People with dementia are 4.8 times more likely to have a positive test result than someone without dementia |
Likelihood ratio for negative results (LR–) = the % of people with the disorder identified as not having it/% specificity | This measure tells us how well the test performs in this population. It is dependent on the accuracy of the test for negative results (specificity) and the proportion of people with the target condition falsely identified as not having the target condition A likelihood ratio < 1 indicates that the result is associated with absence of the disease | In this example LR– =4/80 = 0.05 What does that mean? There is 0.05% chance that someone with dementia will test negative |
How to apply the diagnostic test to a specific patient:
Having found a valid diagnostic test study, and decided that its accuracy is sufficiently high to make it a useful tool, here are some useful points to consider when applying the test to a specific patient:
- Is the test available, affordable, and accurate in our setting?
- Can a clinically sensible estimate of the pre-test probabilities of the patient be made from personal experience, prevalence statistics, practice databases, or primary studies?
- Are the study patients similar to the patient in question?
- How current is the study we are analyzing — has evidence moved on since the publication of the study?
Will the post-test probability affect the management of the specific patient?
- Could the result move the clinician across a test-treatment threshold: for example, could the results of the test stop all further testing? That is, rule the target disorder out so the clinician would stop pursuing that possibility, or make a firm diagnosis of the target disorder and move onto choosing appropriate treatment options.
- Will the patient be willing to have the test carried out?
- Will the results of the test help the patient reach their goals?
Critical appraisal
Based on the information given in the Assessment section above, the table below gives some basic checkpoints to look for when critically appraising a diagnostic test study. This list is by no means comprehensive but should cover all the main issues. The main focus of the list is the first two questions based on validity and the importance of the results.
References
- Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards a complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Clin Chem 2003;49:1–6. https://www.ncbi.nlm.nih.gov/pubmed/12507953
- CASP UK. Critical Appraisal Skills Programme (CASP) https://www.casp-uk.net (last accessed 9 March 2017)
- Sackett DL, Straus SE, Richardson ES, et al. Evidence-based medicine; how to practice and teach EBM. 2nd ed. Edinburgh: Churchill Livingstone, 2000.
- Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999;282:1061–1066. https://www.ncbi.nlm.nih.gov/pubmed/10493205
- Centre for Evidence Based Medicine. https://www.cebm.net/likelihood-ratios/ (last accessed 9 March 2017).