Abstract
Introduction: The study of the effects of immigration on cancer patterns has become increasingly important for health disparities research in the U.S. While data on place of birth are routinely collected in the participating Surveillance, Epidemiology, and End Results (SEER) registries, such data are missing for a large proportion of cases. Furthermore, the distribution of missing nativity data is non-random, and thus cannot be managed with simple strategies such as list-wise deletion. Here we present a multiple imputation (MI) strategy that uses variables in the SEER database to impute nativity status for Hispanic patients diagnosed with invasive cervix (CC), prostate (PC), and colorectal cancer (CRC) between 1988 and 2009. We focus on Hispanic patients, as they represent the largest immigrant group in the U.S.
Methods: We used the SAS MI procedure to generate nativity values (U.S.- vs. foreign-born) by the logistic regression imputation method. Among those with known nativity status, a model was fitted for nativity using a priori-defined parameters that were clinically-relevant or significantly associated with nativity status. Parameters included age, stage at diagnosis, receipt of cancer-directed surgery and/or radiation, SEER site, and Hispanic origin. To impute missing nativity, a new regression model was simulated in 20 iterations using the posterior predictive distribution of parameters based on the fitted regression coefficients. The imputation strategy was validated in a random sample of 20% of the observed data with known nativity status.
Results: Nativity was missing for 31%, 51%, and 37% of CC, PC, and CRC cases, respectively. The imputation strategy performed best for CC and PC. For these cancers, the imputation strategy correctly classified nativity for 93% of cases. The sensitivity and specificity for detecting foreign-born status was high (0.95 and 0.90, respectively, for CC and 0.94 and 0.90, respectively, for PC). For both cancers, there was very good agreement between the true and imputed values (kappa=0.83 and 0.85). While there was high sensitivity, specificity, and agreement for CRC (0.87, 0.91, 0.78, respectively), the imputation strategy misclassified nativity for 11% of cases.
Conclusion: MI by logistic regression performed well for imputing nativity status for CC, PC, and CRC cases, with sensitivity ≥ 0.87 and specificity ≥ 0.90 for detecting foreign-born status, with higher sensitivity among CC and PC cases (≥ 0.94). The misclassification error was less than 10% for CC and PC and was only slightly higher for CRC. Another proposed strategy, which imputes nativity based on date of receipt of a social security number (SSN), has a sensitivity of 0.81 and specificity of 0.80 for detecting foreign-born status among Asians with invasive breast cancer. While we did not evaluate the same population, our data suggest that the proposed MI by logistic regression strategy may more accurately impute nativity status among the high proportion of SEER cases missing these data. Additionally, the strategy uses variables available in the SEER database and is thus significantly less labor-intensive than the SSN method, as SSN is not reported in SEER. Use of the MI strategy will allow researchers to disaggregate analyses by nativity and uncover important nativity disparities in regard to cancer diagnosis, treatment, and survival.
Citation Format: Jane R. Montealegre, Renke Zhou, E. Susan Amirian, Michele Follen, Michael E. Scheurer. Uncovering nativity disparities in cancer patterns: A multiple imputation strategy to handle missing nativity data in the SEER database. [abstract]. In: Proceedings of the Fifth AACR Conference on the Science of Cancer Health Disparities in Racial/Ethnic Minorities and the Medically Underserved; 2012 Oct 27-30; San Diego, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2012;21(10 Suppl):Abstract nr A09.