Implications of outcome misclassification in risk effect modeling in cancer population studies

Liana A. Hill; Paul S. Albert; Jonine D. Figueroa; Danping Liu

doi:10.21037/ace-2025-3

Original Article

Implications of outcome misclassification in risk effect modeling in cancer population studies

Liana A. Hill^1,2, Paul S. Albert¹, Jonine D. Figueroa³, Danping Liu¹

¹Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD, USA; ²Department of Mathematics and Statistics, College of Natural and Mathematical Sciences, University of Maryland, Baltimore County, Baltimore, MD, USA; ³Integrative Tumor Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD, USA

Contributions: (I) Conception and design: All authors; (II) Administrative support: None; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: None; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Paul S. Albert, PhD; Danping Liu, PhD. Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Room 7E630, Rockville, MD 20850, USA. Email: albertp@nih.gov; danping.liu@nih.gov.

Background: Accurate cancer outcome ascertainment is crucial for estimating cancer risk in cohort and case-control studies. Common methods, such as self-reports and cancer registries, can be prone to false positives and false negatives. This study evaluates the performance of cancer risk models subject to imperfect cancer outcome ascertainment.

Methods: We conducted simulations for both cohort and case-control designs to assess the impact of imperfect outcome ascertainment. By varying the sensitivity (Se, 80% to 100%), specificity (Sp, 98% to 100%) and cancer prevalence (1.95% to 7.5%), we examined the bias and coverage rates of relative risk estimation under scenarios of where outcome misclassification does or does not depend on variables in the risk model (i.e., differential and non-differential misclassification).

Results: Non-differential misclassification led to underestimation of relative risks, with bias being more sensitive to imperfect Sp than Se. At low prevalence (1.95%), a small drop in Sp to 99.5% resulted in a −23% bias in estimating the regression coefficient, and confidence interval (CI) coverage of only 4%. In contrast, reducing Se to 90% (with perfect Sp) yielded near-zero bias and 96.2% CI coverage. As prevalence increased to 7.5%, bias from imperfect Sp (99.5%) decreased to −7.4%, while the impact of reduced Se (90%) grew slightly to −6.3%. Differential misclassification—varying by factors such as by sociodemographic variables—produced bias in either direction (from −23% to 17%), depending on how accuracy varied by exposure status.

Conclusions: To reduce bias from non-differential misclassification, efforts should prioritize on minimizing false positives through validation of identified cancer cases, as imperfect Sp significantly attenuates the risk estimates. For differential misclassification, especially with respect to key covariates, we recommend that several sources of cancer outcome data be obtained and used to conduct Se analyses for risk model estimation. Our study demonstrates the potentially large effects of outcome misclassification on cancer risk model estimation. We proposed practical strategies for dealing with this misclassification in population-based cancer studies.

Keywords: Cancer outcome misclassification

Received: 23 April 2025; Accepted: 01 September 2025; Published online: 27 April 2026.

doi: 10.21037/ace-2025-3

Highlight box

Key findings

• Under non-differential misclassification, bias of cancer risk model is worse at lower cancer prevalence settings (1.95%). At lower prevalence, bias is mainly driven by specificity, whereas at higher prevalence, bias is influenced by specificity and sensitivity. With differential misclassification, bias can either underestimate or overestimate effect of binary covariate.

What is known and what is new?

• It is important to minimize misclassification rates when examining general disease outcomes.

• This study quantifies the potentially large effects of outcome misclassification on cancer risk modeling.

What is the implication, and what should change now?

• Validating binary cancer outcomes is necessary to minimize false misclassifications rates. Electronic health records may be beneficial in validating these outcomes.

Introduction

Accurate ascertainment of cancer outcomes is essential for developing reliable cancer risk models. To achieve this, it is important to minimize cancer outcome misclassification, which can occur in two ways: false negatives and false positives. The false negative rate [1 − sensitivity (Se)] is the probability that an individual with cancer is incorrectly classified as cancer-free. Similarly, the false positive rate [1 − specificity (Sp)] is the probability that a cancer-free individual is incorrectly classified as having cancer.

There are two common methods for cancer ascertainment: self-reporting through questionnaires and linkage to cancer registries. Self-reporting often leads to both false positives and false negatives due to inaccuracies in patient-reported data. In contrast, cancer registry linkage is generally considered to be a more accurate and cost-effective approach (1). However, this method is not without its limitations. For instance, missing personal identification information, such as a patient’s social security number, can result in a false positive linkage (2). In addition, cancer registries in the U.S. are state-specific with no routine data exchange mechanisms across states, and investigators may need to link their study population to multiple cancer registries. If a study participant moves outside the registry coverage area, any incident cancers afterwards may be missed, leading to false negative cases (3). To mitigate these issues, medical record validation can be employed to confirm the cancer cases identified by either ascertainment method, significantly reducing the false positive rate. However, since medical record validation is rarely performed for individuals identified as cancer-free, it does not effectively reduce the false negative rate.

When the Se and Sp depend on a covariate in the risk model, the misclassification is called “differential”; otherwise, it is called “non-differential”. A large body of literature addresses differential outcome misclassification (4,5) and non-differential outcome misclassification (6,7) from a general disease perspective. However, few have focused on the context of cancer outcomes, which has several unique features. First, most cancer outcomes are rare, and the impact of misclassifying rare outcomes may differ substantially from diseases with more frequent outcomes (8). Second, cancer outcome ascertainment methods typically have high Sp but relatively low Se. While it is uncommon for these methods to find non-existent cancer cases, it is common for them to miss actual cancer cases, such as due to incomplete registry coverage. Third, cancer outcome misclassifications are often associated with key demographic variables, such as race, ethnicity, age, and gender, which poses a threat to the validity of cancer disparity research. For example, Raza et al. showed that the cancer diagnoses in women were under-reported more frequently in cancer registries compared to men, suggesting that the Se of cancer outcome assessment is related to gender (9). In another example, Randall et al. determined linkage errors may be more common for younger study participants or those living in remote locations, and thus misclassification may be related to age or residential area (10). These studies suggest misclassification may be differential with respect to certain sociodemographic variables.

The purpose of this article is to highlight the effects of outcome misclassification on risk modeling for practicing cancer epidemiologists, and to recommend strategies to minimize the misclassification in data collection. Specifically, we aim to explore the ramifications of false-positive and false-negative rates in cancer ascertainment on relative-risk estimation. We focus on logistic regression modeling within both cohort and case-control study designs. Through extensive simulation studies and analytic calculations, we examine the potential differences in bias when misclassification is differential and non-differential.

Methods

We designed simulation frameworks to evaluate the performance of a risk model when the binary cancer outcome is misclassified. Two independent covariates were simulated: X₁ from a binary distribution with probability 0.5 and X₂ from standard normal distribution. For example, X₁ could represent a sociodemographic variable such as socioeconomic status, where socioeconomic status is dichotomized into two categories: 1 denoting the most disadvantaged and 0 denoting the remaining population (10). Additionally, X₂ could represent standardized age. The true outcome Y (e.g., cancer diagnosis within the first five years of the study) was generated from a logistic regression:

$l o g i t P (Y = 1 | X_{1}, X_{2}) = β_{0} + β_{1} X_{1} + β_{2} X_{2}$ [1]

where $l o g i t (p) = log {p / (1 - p)}$ , $(β_{1}, β_{2}) = (1, 1)$ , and β₀ takes the values from −5 to −3.5. The intercept parameter was varied to control the cancer prevalence between 1.95% and 7.5%. These prevalences represented typical cancer outcome frequencies in epidemiologic studies. The low prevalence of 1.95% mimics rarer cancers over 5–10 years, while the high prevalence of 7.5% captures moderately common cancers over longer follow-up. This range allows us to examine how misclassification impacts bias across realistic scenarios. We considered simulations under two scenarios: one where the Se and Sp did not vary by race or age (non-differential misclassification), and another where these parameters differed by racial group (differential misclassification).

In the first scenario of non-differential misclassification, the observed outcome Y^* was generated as a misclassified version of Y, with Se ranging from 80% to 100% and Sp ranging from 98% to 100%. For example, with a sample size 50,000 and prevalence of 1.95%, 80% Se roughly corresponds to 200 false negatives, and 98% Sp corresponds to about 1,000 false positives.

In the second scenario of differential misclassification, we considered four cases where Se/Sp depend on the binary covariate X₁:

Se is 75.7% for the subjects with X₁=0 and 90% with X₁=1. Sp is 100% for all the subjects.
Se is 90% for the subjects with X₁=0 and 75.7% with X₁=1. Sp is 100% for all the subjects.
Sp is 99% for the subjects with X₁=0 and 100% with X₁=1. Se is 100% for all the subjects.
Sp is 100% for the subjects with X₁=0 and 99% with X₁=1. Se is 100% for all the subjects.

The misclassification parameters were chosen to facilitate comparisons between the differential and non-differential cases. For example, the parameters in the first two cases were chosen so that the overall Se for the differential case was 80%, aligning with a setting presented for the non-differential scenario. Similarly, the parameters in the last two cases correspond to the non-differential scenario of a Sp equal to 99.5% and a Se equal to 100%.

Our simulations focus on a prospective cohort design, where a binary outcome is generated for every participant. Given that cancer outcomes are often rare, a case-control study is a practical design to study etiological effects on cancer risks. We will present findings for the case-control design in the supplementary materials (see Tables S1-S6 for misclassification results and Figures S7-S10 for coverage rates).

Each simulation generated 50,000 individuals, representing a cohort study dataset. The large sample size was used to illustrate bias patterns that are driven by Se and Sp, but not by finite sample bias (i.e., risk estimates for logistic regression are biased for rare diseases with low prevalence) . The case-control design then drew a sample of 1,000 individuals based on Y^*, including 500 individuals each with Y^*=0 and Y^*=1. Logistic regression models were then fitted to the full cohort of 50,000 individuals and the case-control samples of 1,000 individuals, under different settings of disease prevalence, Se and Sp of Y^*. This process was repeated 500 times, and we reported the bias, standard deviation of the point estimates, the mean of standard errors, and confidence interval (CI) coverage rates for the estimated parameters (β₁,β₂). The coverage rates were calculated as the proportion of simulations in which a 95% CI covers the true parameter value. In theory, with a valid interval inference, the CI coverage rate should be close to 95%. All simulations were conducted using R version 4.3.1.

In the supplementary material, we computed theoretical calculations of bias under various settings of non-differential and differential outcome misclassifications. These calculations were simplified by considering only one binary covariate (as opposed to the two covariates used in the simulation settings), allowing us to examine the effects of outcome misclassification in 2×2 tables (see Tables S7,S8). Unlike the simulation results, these calculations enable the exploration of a wider range of Se and Sp grid points, providing a more comprehensive assessment of the bias patterns. These results are shown in contour plots presented in the supplementary materials (see Figures S11-S13). Nonetheless, we emphasize the importance of the simulation studies, as they examine estimation performance in practical settings.

Results

Non-differential misclassification

Table 1 shows the bias of estimating β₁ using the full cohort data at 1.95% disease prevalence with a Sp ranging from 98% to 100%, and Se from 80% to 100%. The reported negative biases indicate an underestimation of the associations between the risk factors and the cancer outcome. We observe that bias is highly influenced by Sp. For example, with a very high Sp of 99.5% and perfect Se, the bias in estimating β₁ was 22.9% with a poor CI coverage rate of 4%. The bias increased substantially to 53.7% when Sp dropped to 98%. In contrast, Se has a much smaller impact on bias. With a Se of 80% and perfect Sp, the bias in estimating β₁ is less than 1%. For cancer outcome ascertainment with a prevalence around 2%, these findings suggest that controlling the false positives (i.e., achieving perfect Sp) is much more important than controlling the false negatives (1 − Se).

Table 1

β₁ of cohort design at 1.95% prevalence with non-differential misclassification

Specificity (%)	Sensitivity (%)	Bias	Standard deviation	Mean of standard error	Coverage (%)
100	100	0.010	0.071	0.073	94.8
100	90	0.000	0.076	0.077	96.2
100	80	−0.006	0.080	0.082	95.6
99.9	100	−0.051	0.068	0.071	88.8
99.9	90	−0.059	0.072	0.074	87.6
99.9	80	−0.079	0.076	0.078	84.6
99.5	100	−0.229	0.063	0.063	4.0
99.5	90	−0.249	0.063	0.065	2.8
99.5	80	−0.273	0.071	0.068	2.2
99	100	−0.370	0.055	0.056	0.0
99	90	−0.392	0.059	0.058	0.0
99	80	−0.424	0.058	0.059	0.0
98.5	100	−0.473	0.049	0.051	0.0
98.5	90	−0.498	0.053	0.052	0.0
98.5	80	−0.521	0.051	0.054	0.0
98	100	−0.537	0.049	0.048	0.0
98	90	−0.563	0.050	0.049	0.0
98	80	−0.591	0.050	0.050	0.0

Table 2 examines the same scenarios as Table 1, but with a higher prevalence of 7.5% compared to 1.95%. It is notable that for low prevalence, bias is primarily driven by imperfect Sp, whereas at higher prevalence, Se started to play a more important role. For example, in the scenario with Sp of 99.5% and Se of 100%. The bias was 22.9% under low prevalence but decreased to 6.9% under high prevalence. This suggests that imperfect Sp has a higher impact on bias when the disease prevalence is low. Conversely, for the scenario with Se =80% and Sp =100%, the bias is 0.6% for low prevalence and 3.0% for high prevalence. While Se is not as strong a driver of bias as Sp, its impact grows with increased prevalence. The results for estimating β₂(corresponding to the continuous covariate) are presented in Tables S9,S10, showing similar patterns of bias.

Table 2

β₁ of cohort design at 7.5% prevalence with non-differential misclassification

Specificity (%)	Sensitivity (%)	Bias	Standard deviation	Mean of standard error	Coverage (%)
100	100	0.000	0.037	0.039	96.0
100	90	−0.015	0.039	0.04	93.2
100	80	−0.030	0.042	0.042	88.8
99.9	100	−0.013	0.038	0.038	93.4
99.9	90	−0.032	0.04	0.04	87.6
99.9	80	−0.048	0.043	0.042	76.8
99.5	100	−0.069	0.037	0.037	55.0
99.5	90	−0.095	0.040	0.038	31.6
99.5	80	−0.116	0.038	0.040	16.2
99	100	−0.133	0.036	0.035	3.8
99	90	−0.159	0.038	0.037	1.8
99	80	−0.187	0.039	0.038	0.2
98.5	100	−0.190	0.035	0.034	0.0
98.5	90	−0.214	0.036	0.035	0.0
98.5	80	−0.243	0.036	0.037	0.0
98	100	−0.233	0.033	0.033	0.0
98	90	−0.262	0.033	0.034	0.0
98	80	−0.298	0.037	0.036	0.0

In Tables 1,2, the CI coverage rates of β₁ are poor when the bias gets over 5% (see Tables S9,S10 for the CI coverage rates of β₂). The average of estimated standard errors is close to the Monte Carlo standard deviations, suggesting that the variance estimation is accurate under outcome misclassification. However, these CI coverage rate results are pertaining to the sample size (50,000) considered. With such a large sample, the coverage rate depends on both bias and sample size and is heavily influenced by the bias. For cohorts with smaller sample sizes, we expect the coverage rates to be less influenced by bias and closer to nominal level. Figure S5 shows this by comparing the coverage rates of β₁ for cohorts with 5,000 and 50,000 individuals (see Figure S6 for β₂).

Differential misclassification

Table 3 shows the simulation results with differential Se (equals 90% or 75.7% depending on the binary covariate X₁, with an overall Se of 80%). We also show results for the analogous non-differential case with the same overall Se of 80%. Unlike the case of non-differential Se, where estimation is only slightly attenuated, differential Se results in substantial bias in estimating β₁ that can be in either direction. Conceptually, if X₁ represent socioeconomic status (1 denoting the most disadvantaged and 0 denoting the remaining population) and the Se is 90% for individuals with the most disadvantaged socioeconomic status and 75.7% for individuals with any other socioeconomic status, then this misclassification results in a 22.4% underestimation of the effect of . Conversely, when the sensitivities are reversed for individuals with the most disadvantaged socioeconomic status and the remaining population, there is a 16.9% overestimation.

Table 3

Cohort design comparisons with non-differential and differential sensitivity at 100% specificity

	Non-differential		Differential
	β₁	β₂
	β₁	β₂	β₁	β₂	β₁	β₂
Sensitivity	Se₀= Se₁=0.8		Se₀=0.9; Se₁=0.757		Se₀=0.757; Se₁=0.9
Point estimate	0.97	0.965	0.777	0.963	1.169	0.977
Bias	−0.03	−0.035	−0.224	−0.037	0.169	−0.023
Coverage (%)	88.80	63.20	0.00	55.60	2.00	78.40
Standard deviation	0.042	0.021	0.042	0.021	0.042	0.021
Mean of standard errors	0.042	0.021	0.041	0.021	0.042	0.021

Similarly, with differential Sp, the bias can also be in either direction (Table 4), in contrast to bias towards the null with non-differential Sp. It is also worth noting that, as Se and Sp are differential with respect to X₁, but not X₂, the bias in estimating β₂ is comparable between differential and non-differential misclassification (Tables 3,4).

Table 4

Cohort design comparisons with non-differential and differential specificity at 100% sensitivity

	Non-differential		Differential
	β₁	β₂
	β₁	β₂	β₁	β₂	β₁	β₂
Specificity	Sp₀= Sp₁=0.995		Sp₀=1.00; Sp₁=0.99		Sp₀=0.99; Sp₁=1,00
Point estimate	0.928	0.933	1.090	0.937	0.768	0.925
Bias	−0.072	−0.067	0.090	−0.063	−0.233	−0.075
Coverage (%)	50.40	4.40	30.00	14.00	0.00	3.20
Standard deviation	0.036	0.018	0.038	0.019	0.033	0.020
Mean of standard errors	0.037	0.019	0.038	0.019	0.036	0.019

Discussion

In this paper, we examined the impact of cancer outcome misclassification through extensive simulations and theoretical bias calculations. We considered both cohort and case-control designs, but focused primarily on cohort studies as the results were similar between the two designs.

When misclassification is non-differential, we found that imperfect Sp introduces a substantial underestimation of relative risks, particularly with a low prevalence. If cancer outcome ascertainment is based on self-report, lower Sp often results from misreporting cancer types (e.g., report a carcinoma in situ as cancer). While registry linkage is generally more accurate than self-report, false positives are still possible. For example, with incomplete social security information, an individual with a common name may be mistakenly identified as having cancer due to a false match in the registry linkage (2).

We also found that non-differential Se resulted in only a small bias in estimating relative risks. In practice, lower Se is quite common in cancer outcome ascertainment. False negatives in self-reported cancer outcomes can result from an individual failing to recall a cancer diagnosis (3). For cancer ascertainment through registries, false negatives can occur in several ways including, non-coverage in certain regions, under-reporting cancers to registries, or individuals moving out of registry coverage areas (3). Our findings suggest that we do not need to validate the non-cases if the Se is non-differential.

In contrast, when the misclassification is differential, the bias patterns are strikingly different from the non-differential case: bias can be substantial and occur in either direction. Unlike non-differential misclassification, where false negatives do not result in substantial bias, relative risk estimation can be seriously biased when the Se varies with a covariate in the model. For example, the effects of race-ethnicity on cancer risk may be poorly estimated in logistic regression when the Se or Sp of cancer ascertainment also varies by race-ethnicity. This finding emphasizes the value of understanding differences in the cancer ascertainment process by race-ethnicity in health disparities research, and recognizing situations where misclassification is linked to a key exposure.

Linet et al. [2020] provided a comprehensive review of 26 radiation cohort studies highlighting challenges in cancer outcome ascertainment and implications for risk estimation (11). They noted that when misclassification is non-differential to the key exposure (radiation dose levels), relative risk estimates tend to be biased towards the null, consistent with our simulation findings. They also identified four studies where the outcome misclassification appeared to vary by radiation level, though the impact of such misclassification was not quantified. In contrast, our study performed extensive simulation studies to quantitatively evaluate the bias in risk estimation by both differential and non-differential outcome misclassifications. More recently, Liu et al. [2025] directly compared cancer risk estimates based on self-reported diagnoses versus registry linkage in a large U.S. cohort, demonstrating that risk estimates varied depending on the outcome ascertainment method, with self-report introducing attenuation likely due to lower Sp (12). These empirical findings align with our simulation results that reductions in Sp can produce substantial underestimation of relative risk, particularly when disease prevalence is low. In addition, recent work in the context on electronic health records (EHRs) has shown similar patterns. For example, Zhang et al. [2024] demonstrated through simulations that misclassification of EHR-derived cancer outcomes can introduce meaningful bias in effect estimates (13). Together, these studies reinforce our key conclusion: the accuracy of cancer outcome ascertainment, particularly Sp, plays a critical role in the validity of cancer risk models.

While our simulations cover a wide range of realistic scenarios, they cannot exhaust all possible combinations of prevalence, sample sizes, effect sizes, and misclassification mechanisms. We provided the simulation code in the Supplementary Materials and recommend that practitioners tailor the simulations to their own context to assess how outcome misclassification may influence their relative risk estimates.

We acknowledge that our study has several limitations. First, we focus on binary cancer outcomes, not time to cancer diagnosis. Time-to-event outcomes introduce additional complications, as the timing of the event and censoring may be subject to reporting errors, potentially leading to the bias in hazard ratio estimates. While our conclusions likely extend to time-to-event analysis in terms of direction of bias, the magnitude and pattern may differ, warranting further investigations. Second, our simulations assume that Se/Sp are known. In practice, when validation data are limited or unavailable, reliable estimation of Se/Sp may not be feasible. In such cases, we recommend that researchers conduct Se analyses or simulation studies across plausible ranges of Se/Sp to assess the robustness of their findings.

Our study highlights the importance of using validated cancer ascertainment methods with a gold-standard (e.g., medical records). For non-differential misclassification, this is an easier problem since we only need to validate those identified as cancer cases to mitigate false positives. Validation becomes more challenging for differential outcome misclassification because both false positives and false negatives can introduce severe bias. The difficulty in this case is that correcting false negatives requires the validation of a large fraction of the study population. We recommend that several sources of cancer outcome data be obtained and used to conduct Se analyses for risk model estimation. For example, if self-report or cancer registry linkage is used as the primary outcome assessment method, EHR may be a good source for identifying missed cases while also confirming identified cases (14).

Conclusions

Our study quantified the potentially large effects of having a differentially and non-differentially misclassified binary cancer outcome in a cancer risk model. Our findings highlight the importance of having accurate cancer outcome ascertainment. We found that bias may be in either direction with differential misclassification, whereas under non-differential misclassification, bias underestimated the cancer risks estimates. Researchers may need to collect several sources to validate the cancer outcomes, and recognize situations where outcome is linked to key covariate.

Acknowledgments

None.

Footnote

Peer Review File: Available at https://ace.amegroups.com/article/view/10.21037/ace-2025-3/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://ace.amegroups.com/article/view/10.21037/ace-2025-3/coif). D.L. received support from National Cancer Institute (the author is an employee of this institute). The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Pinsky PF, Yu K, Black A, et al. Active follow-up versus passive linkage with cancer registries for case ascertainment in a cohort. Cancer Epidemiol 2016;45:26-31. [Crossref] [PubMed]
Bond B, Brown JD, Luque A, et al. The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey. United States Census Bureau; 2014.
Liu D, Linet MS, Albert PS, et al. Ascertainment of Incident Cancer by US Population-Based Cancer Registries Versus Self-Reports and Death Certificates in a Nationwide Cohort Study, the US Radiologic Technologists Study. Am J Epidemiol 2022;191:2075-83. [Crossref] [PubMed]
Chen Q, Galfalvy H, Duan N. Effects of disease misclassification on exposure-disease association. Am J Public Health 2013;103:e67-73. [Crossref] [PubMed]
Chyou PH. Patterns of bias due to differential misclassification by case-control status in a case-control study. Eur J Epidemiol 2007;22:7-17. [Crossref] [PubMed]
Dosemeci M, Wacholder S, Lubin JH. Does nondifferential misclassification of exposure always bias a true effect toward the null value? Am J Epidemiol 1990;132:746-8. [Crossref] [PubMed]
Weinberg CR, Umbach DM, Greenland S. When will nondifferential misclassification of an exposure preserve the direction of a trend? Am J Epidemiol 1994;140:565-71. [Crossref] [PubMed]
Mullins MA, Kler JS, Eastman MR, et al. Validation of Self-reported Cancer Diagnoses Using Medicare Diagnostic Claims in the US Health and Retirement Study, 2000-2016. Cancer Epidemiol Biomarkers Prev 2022;31:287-92. [Crossref] [PubMed]
Raza SA, Jawed I, Zoorob RJ, et al. Completeness of Cancer Case Ascertainment in International Cancer Registries: Exploring the Issue of Gender Disparities. Front Oncol 2020;10:1148. [Crossref] [PubMed]
Randall S, Brown A, Boyd J, et al. Sociodemographic differences in linkage error: an examination of four large-scale datasets. BMC Health Serv Res 2018;18:678. [Crossref] [PubMed]
Linet MS, Schubauer-Berigan MK, Berrington de González A. Outcome Assessment in Epidemiological Studies of Low-Dose Radiation Exposure and Cancer Risks: Sources, Level of Ascertainment, and Misclassification. J Natl Cancer Inst Monogr 2020;2020:154-75. [Crossref] [PubMed]
Liu D, Linet MS, Albert PS, et al. Examining bias due to method of follow-up for cancer incidence in a large U.S. cohort: Self-report versus registry linkage. Ann Epidemiol 2025;107:44-50.
Zhang H, Clark AS, Hubbard RA. A Quantitative Bias Analysis Approach to Informative Presence Bias in Electronic Health Records. Epidemiology 2024;35:349-58. [Crossref] [PubMed]
Leggat-Barr K, Ryu R, Hogarth M, et al. Early Ascertainment of Breast Cancer Diagnoses Comparing Self-Reported Questionnaires and Electronic Health Record Data Warehouse: The WISDOM Study. JCO Clin Cancer Inform 2023;7:e2300019. [Crossref] [PubMed]

doi: 10.21037/ace-2025-3
Cite this article as: Hill LA, Albert PS, Figueroa JD, Liu D. Implications of outcome misclassification in risk effect modeling in cancer population studies. Ann Cancer Epidemiol 2026;10:3.

Implications of outcome misclassification in risk effect modeling in cancer population studies

Highlight box

Introduction

Methods

Results

Non-differential misclassification

Table 1

Table 2

Differential misclassification

Table 3

Table 4

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share