Abstract
Work is needed to better understand how joint exposure to environmental and economic factors influence cancer. We hypothesize that environmental exposures vary with socioeconomic status (SES) and urban/rural locations, and areas with minority populations coincide with high economic disadvantage and pollution.
To model joint exposure to pollution and SES, we develop a latent class mixture model (LCMM) with three latent variables (SES Advantage, SES Disadvantage, and Air Pollution) and compare the LCMM fit with K-means clustering. We ran an ANOVA to test for high exposure levels in non-Hispanic black populations. The analysis is at the census tract level for the state of North Carolina.
The LCMM was a better and more nuanced fit to the data than K-means clustering. Our LCMM had two sublevels (low, high) within each latent class. The worst levels of exposure (high SES disadvantage, low SES advantage, high pollution) are found in 22% of census tracts, while the best levels (low SES disadvantage, high SES advantage, low pollution) are found in 5.7%. Overall, 34.1% of the census tracts exhibit high disadvantage, 66.3% have low advantage, and 59.2% have high mixtures of toxic pollutants. Areas with higher SES disadvantage had significantly higher non-Hispanic black population density (NHBPD; P < 0.001), and NHBPD was higher in areas with higher pollution (P < 0.001).
Joint exposure to air toxins and SES varies with rural/urban location and coincides with minority populations.
Our model can be extended to provide a holistic modeling framework for estimating disparities in cancer survival.
See all articles in this CEBP Focus section, “Environmental Carcinogenesis: Pathways to Prevention.”
Introduction
There is growing awareness that people are exposed to numerous environmental and social factors and that all of these may interact to influence an individual's susceptibility to disease or poor health. Our objective is to quantify the relationship between socioeconomic status (SES) and air quality and develop a framework for understanding the interplay of these factors and their relationship to disparities in cancer outcomes. Both factors are known to be associated with human health outcomes. The health disparities associated with low SES are wide ranging and include obesity (1), type 2 diabetes (2), and cancer (3–9). Negative health outcomes linked to air pollution include respiratory (10, 11), cardiovascular (12–14), increased rates of mortality (15, 16), and various cancers. Fine particulate matter (PM), volatile organic compounds (VOC), and other traffic-related air toxins have been linked to lung cancer (17–19); heavy metals (HM) have been indicated as influential in tumor formation (20–22).
Current studies investigate the interplay of SES, air quality, and health. Coker and colleagues illustrate how pollution, built environment, and SES influence adverse birth outcomes in Los Angeles using Bayesian profile regression (23). Chi and colleagues use Cox proportional hazard models to study SES and the association between air pollution and cardiovascular disease (24). Vieira and colleagues also use Cox proportional hazard models to show the impact of SES and air pollution on geographic disparities in ovarian cancer survival in California (25). Weaver and colleagues illustrate the impact of fine PM and SES on cardiovascular health and diabetes by using Wald hierarchical clustering to identify clusters of SES factors and showing that areas of relative SES disadvantage have stronger associations between air pollution and negative health outcomes as compared with areas of SES advantage (26).
In addition to studying air pollution in conjunction with SES when examining health outcomes, there has also been increased interest in the impact of multipollution exposure (MPE) on poor health. Recent studies highlight that measuring a single pollutant may not adequately capture the total environmental burden individual's face in a community (27–32) and methodologies have emerged to model multiple exposure to air toxins. For example, Keller and colleagues use a K-means clustering method to establish MPE for cohorts (33). Lenters and colleagues study the effects of MPE on birth weight using elastic nets (34). Zanobetti and colleagues present a clustering method to show how multipollutant mixtures impact mortality (35). Pirani and colleagues use a Bayesian approach to model MPE (36).
We present a latent class mixture model (LCMM) to quantify the relationship between SES and MPE. LCMMs are used to understand latent grouping mechanism behind correlated variables, like pollutants and SES variables. Once latent classes or groupings are accounted for, the variables are assumed to be conditionally independent (37). The advantage of LCMMs is that they are easily visualized to demonstrate interrelationships of measures, and can generate flexible summaries of many variables. For example, using LCMMs allows us to take a multitude of pollutants and come up with discrete classifications that describe the joint impact of those pollutants. LCMMs have been used previously in epidemiologic studies (38), including those that model the impacts of MPE on human health (39) as well as SES on health (40). To our knowledge, this is the first application of LCMMs to study the interaction of MPE and SES together.
We develop a LCMM to quantify joint levels of exposure to pollution mixtures and SES at the census tract level in North Carolina (NC). We use this model to examine racial disparities in exposure and differences in urban versus rural communities. To highlight the strengths of using LCMM over other clustering methods, we compare our model with K-means clustering. While K-means clustering is a simple and effective algorithm for clustering correlated observations into groups, LCMM provides the option to include expert knowledge into the model between observation and the latent variables. Furthermore, we present a framework that builds upon this research to delineate disparities in cancer outcomes. We use publicly available data on pollutants and indicators of SES. We focus on air toxins such as diesel PM, VOCs, and HMs. We collect social and economic variables known to be indicative of either SES advantage or disadvantage (40, 41). We chose NC as the study domain because of its diversity in SES, presence of both rural and urban counties and heterogeneity in air pollution. We identify vulnerable populations in NC at the census tract level and gain a better understanding of the interplay between SES and MPE.
Materials and Methods
We acquired data on air pollutants from the U.S. Environmental Protection Agency's (EPA) National Air Toxics Assessment (NATA; ref. 42). The NATA has been conducted approximately every 3 years since 1996. For this analysis, we used the most recent data from 2014. In the NATA, the EPA estimates ambient concentrations of air toxins via air quality models that use emissions data from the National Emissions Inventory along with meteorologic and other data as inputs (43). The result is spatially resolved numerically modeled estimates of multiple pollutants.
The NATA provides estimates of hazardous air pollutants (HAP) and diesel PM. HAPs, a class of pollutants that are considered carcinogenic and associated with negative health outcomes, are regulated under the 1990 Clean Air Act (42). Diesel PM is included in the NATA as an air toxic, but not considered a HAP under the 1990 Clean Air Act (43). However, some scientific evidence suggests a possible link between exposure to pollution from heavy traffic and breast cancer, so we include diesel PM in our model (44). In addition, we include 11 VOCs [acetaldehyde, benzene, 1,3-butadiene, carbon tetrachloride, ethylbenzene, formaldehyde, hexane, methanol, methyl chloride (chloromethane), toluene, and xylene (mixed isomers)] and four HMs (lead, manganese, mercury, and nickel). Maps of each pollutant in NC, where each pollutant is presented as a percentile in relation to NC, are presented in Supplementary Figs. S1–S3.
Our socioeconomic data is from the American Community Survey (ACS), which has been conducted by the U.S. Census Bureau since 2005. We used the 1-year 2014 survey to align with the NATA data timeframe (45). We selected variables from the ACS that have been found to reflect both socioeconomic advantage and disadvantage (2016; ref. 40). Palumbo and colleagues show that SES factors can be distinguished by latent advantage/disadvantage groups, providing a more flexible and nuanced way to describe SES as compared with other continuous metrics such as the neighborhood SES deprivation index (e.g., NSES). We downloaded the data using the R package, tidycensus. The SES Disadvantage variables (40) include data on households such as the proportion of households with a single head of house or a female head of household and the proportion of people renting in a given census tract. We also report the proportion of crowded households or houses having more than one person per room, which is generally considered a sign of overcrowded housing (46). We also include the proportion of people without a vehicle, with phone service, below the poverty line, relying on public assistance, and unemployed. The SES Disadvantage latent group includes data on race/ethnicity via the proportion of non-Hispanic black individuals in a given census tract. The SES Advantage variables are related to education and profession (40) and include the proportion of males and females in a professional occupation, and the proportion of people with less than a high school education. Maps of each SES variable in NC are provided in the Supplementary Materials (Supplementary Figs. S4 and S5).
LCMM
We employ LCMMs to investigate how pollution and SES interact by identifying latent classes underlying the observed data. Our model has three latent variables: SES Disadvantage, SES Advantage, and Toxins and Pollutants. We allowed covariance between the three latent variables. Our LCMM is illustrated in Fig. 1.
To preprocess the data, we applied square root and log transformations to the SES and pollution data, respectively, to achieve normality for each group of variables because the SES variables are proportions, whereas the pollution data are continuous.
We ran several versions of the model, each with different numbers of classes for each latent variable. For assessing model fit, typical metrics to consider include model convergence, fit statistics such as BIC, AIC, and entropy, and class membership, that is, the percentage of census tracts classified under each level. To assess uncertainty in the best-fit model, we examine the probability of a census tract being assigned to a given level, if this probability is high, we can conclude that the assignment has low uncertainty.
We used MPlus [v 1.5 (1)] to run the analysis, which uses maximum likelihood estimation to estimate the level of exposure for each census tract. We also used the R package, MplusAutomation to run the analysis within R, which is a tool that calls MPlus from within R to facilitate preprocessing and postprocessing of data from these models. To evaluate presence of racial disparities in our data, we ran an ANOVA of non-Hispanic black population density versus the low/high SES Disadvantage, SES Advantage, and pollution latent classes using the R function, lm().
We also conducted K-means clustering on the SES and pollution variables as a comparison. We tested up to 15 clusters using the kmeans() function in R. We evaluated each model using NbClust(), which utilizes several metrics to choose the number of cluster which best fit the data. We computed AIC and BIC to compare the fit of the best K-means model with the LCMM.
Results
We present summary statistics of each SES and pollution variable for a given census tract in NC as compared with the United States in Table 1. For both SES Advantage and SES Disadvantage, the NC averages are higher than the national average. For the pollutant variables, the opposite is true of most VOCs (1,3-butadiene, benzene, ethylbenzene, hexane, methyl chloride, and xylenes), diesel PM, lead compounds, mercury compounds, and nickel compounds. However, acetalaodehyde, formaldehyde, and methanol are higher in NC than the national average.
The best-fit model for NC comprised of mixtures of two levels of Pollutants, two levels of SES Advantage, and two levels of SES Disadvantage, for a total of eight levels of exposure. We modeled the SES latent variables with two levels to reflect the findings in the Palumbo study (40). We ran several models with varying levels within the Pollutants latent variable. We found that including more than two levels resulted in nonconvergence in the model, so we limited our final model to include only two levels for the pollutant latent variable.
We estimated the mean of each observed variable under each of the eight levels and conclude that the levels or classes within each latent variable can be characterized as low and high. The latent variable profiles are displayed in Fig. 2 as bar plots, where the pollution variables are expressed as percentiles and the SES variables are expressed as percents. In the low pollution level, the majority of the pollutants fall around the 20th to 25th percentile, except for mercury, nickel, and methyl chloride (40th–50th percentile). In the high pollution level, almost every pollutant at least doubles; methyl chloride hardly changes, and mercury and nickel only increase roughly to the 60th percentile.
In areas with low SES Advantage, on average, the percentage of males and females in professional occupations is estimated to be low (<5%), while the percentage of people who did not graduate from high school is estimated to be close to 20%. Conversely, in high SES Advantage areas, the estimated percentage of people in professional occupations rose to roughly 10% for both males and females, while the percentage of the population without a high school degree dropped to around 5%. Examining the SES Disadvantage latent variable, the model estimates that the following variables are found in higher percentages in the High Disadvantage class as opposed to the low SES Disadvantage class: no vehicle, public assistance, unemployed, crowded housing, renting, single householder, female householder, below poverty line. We found that phone service was the only SES Disadvantage variable to barely follow this trend: phone service is close to 100% in both the high disadvantage areas and only slightly higher in low disadvantage areas.
We predicted the level of exposure for each census tract, based on the highest predicted class probability for each tract. Overall, 34.1% of the census tracts have high SES Disadvantage, 66.3% have low SES Advantage, and 59.2% have high mixtures of air pollutants. Our analysis shows that rural areas are sometimes observed under the highest levels of pollution, in addition to metropolitan areas. Areas of high SES Advantage are concentrated in urban/suburban and coastal areas, while SES Disadvantage dominates the eastern half of the state as well as many urban areas. Table 2 displays the number of census tracts in each level, as well as the number of those that are rural. In the “Census tracts” column, the values down the column all sum to the total number of census tracts in NC (N = 2,174). In the “Rural census tracts” column, each row is the percentage of census tracts in a given exposure level that are classified as rural. For example, in exposure Level 1, 162 of the 216 census tracts are rural, that is, 75%.
These patterns are visible in Fig. 3, which displays estimated exposure level for each census tract in NC. The rural areas of the eastern half of the state are dominated by Level 1 (high disadvantage, low advantage, and low pollution) and comprise 9.9% of the census tracts. We did not observe any census tracts in Level 2 (high disadvantage, high advantage, and low pollution) due to the fact that the combination of high disadvantage and high advantage is rare (40).
Level 3 (low disadvantage, low advantage, and low pollution) dominates the rural areas of the western half of the state and comprises 25.2% of the census tracts. The most ideal level, Level 4 (low disadvantage, high advantage, and low pollution), comprises only 5.7% of the tracts and can be found exclusively in suburban areas throughout the state as well as on the coast.
Level 5 (low disadvantage, high advantage, and high pollution) is the most prevalent class with 25.8% membership and is concentrated in urban areas across NC. Level 6 (low disadvantage, low advantage, and high pollution) is in a few urban areas in the central part of the state, as well as suburban mountain and inner coastal areas, comprising 9.2% of the census tracts. The smallest exposure level is Level 7 (high disadvantage, high advantage, and high pollution) with only 2.2% membership and is found exclusively in urban centers throughout NC. The most toxic class, Level 8 (high disadvantage, low advantage, and high pollution) can be found in 22% of the census tracts and is largely located in the urban areas of central NC as well as scattered in rural mountain, inner coastal and coastal areas. This is one of the few levels to associate high pollution with rural areas.
To assess the uncertainty of these predictions, we examined the probabilities of a census tract being assigned to a given level. In the Supplementary Materials (Supplementary Fig. S6), we present a box plot which shows the distribution of assignment probabilities for seven of the eight latent variable levels. Level 2 is not represented because in our analysis, no census tracts were assigned to Level 2. In the case of all levels except Level 7, the median is above 0.80, leading us to believe that there is low uncertainty in the results of our analysis for these levels. The median for Level 7 is approximately 0.75.
Our results also provided insights into economic disparities based on race/ethnicity. Areas with higher SES Disadvantage had significantly higher black population density (P < 0.001). Similarly, black population density was higher in areas with higher pollution (P < 0.001).
For the K-means clustering analysis, we found that four clusters best fit the data. In the Supplementary Materials (Supplementary Figs. S7 and S8), we provide visualization of the cluster characteristics and the census tract membership in NC. There was a similar pattern of unfavorable SES and Pollution levels in urban areas, and favorable in suburban areas. However, with only four clusters, the clustering from K-means is less nuanced than those from the LCMM. On the basis of AIC and BIC (lower values are better), the LCMM was a better fit to the data (BIC/AIC for LCMM = −57,345.93/57,874.57, BIC/AIC for K-means = 29,646.9/29,019.16).
Model extension for cancer disparities
We have proposed an initial framework to link SES and environmental exposures. In Fig. 4, we demonstrate how this model can be extended as a means to delineate cancer outcomes disparities. Our future work involves utilizing this novel framework to determine potential modifiable factors that could be employed to reduce disparities in cancer outcomes. Each oval in Fig. 4 represents a domain of interest, measured or determined by a set of variables, similar to that shown in Fig. 1.
Clinical care and access could be measured at the individual level, such as type of insurance, does patient have a primary care physician, proximity to care. Individual behavior could include physical activity, smoking, alcohol, and other measures that infer higher (or lower) risk. The Biology and Genetics/Ancestry domain could incorporate high-dimensional data, or a smaller set of markers thought to influence cancer outcomes. For high-dimensional data, a variable selection process would also have to be included. Other extensions, such as spatiotemporal correlation will be critical to deepening our understanding of how exposures over time contribute to the development of disease. We continue to work with this ultimate framework in mind, and will share techniques, software, and methods, as they are extended. For different cohorts and diseases that we study, some domains are critical, while others may not be, so we will not always utilize every domain area. But we consider this framework as a starting point for thinking about how to model cancer outcomes disparities. We recognize that there can be many differential exposures, access, behaviors, which may all contribute to the outcomes disparities we see. For example, in studies of breast cancer, we have utilized a biological domain of the subtyping biomarkers, and we incorporate an association of race as well as one to SES. It is well-known that African American women more frequently have triple-negative breast cancer, and this likelihood may vary with SES. This modeling framework and methodology provides the ability to flexibly model such relationships.
Discussion
To our knowledge, this work provides the first framework for an exposure model based on a broad range of SES measures and environmental toxins. The model can also be extended to incorporate cohorts (or trials) with biological measures, clinical care and access, behavioral measures, and health outcomes. LCMMs have the advantage of providing a flexible approach for fitting and testing theoretical relationships. Furthermore, these structures and relationships can be illustrated via diagrams (e.g., Figs. 1 and 4), which, though representative of complex statistical models, can be used to bridge knowledge gaps and further research within multidisciplinary teams of experts by providing accessible visualizations of potential hypotheses to be tested. Also, LCMMs have the added benefit of allowing for uncertainty quantification, an essential modeling feature when it comes to working with results with the potential to inform cancer treatment. We illustrate this advantage by comparing LCMM with a widely used clustering technique, K-means clustering, which not only underperformed in terms of fitting the data, but also lacked any measures of uncertainty in assigning census tracts to exposure clusters/classes. In addition, the LCMM modeling approach can incorporate the fact that race and SES (as well as other measures), which are typically highly correlated, can be incorporated with that correlation taken into account. This allows us to interpret the impact of SES, which is informed by race, rather than assuming the effect of race is “removed” when controlling for SES. Furthermore, many potential exposures can be incorporated, the mixture of exposures can be tested, and, when planning interventions, areas that exhibit higher risk populations can be easily identified and prioritized.
Gray and colleagues (48) used modeled predicted surfaces to examine the relationship between air pollution exposure, race, and measures of SES in NC. They considered only PM2.5 and O3 as the pollutants, and assessed poverty, education, and income as SES area level measures from the 2000 census, as well as consideration of the neighborhood deprivation index (41). Similar to our results, they found that PM2.5 was higher in areas with lower SES, higher deprivation, and higher minority population density. Weaver and colleagues (26) examined the joint impact of SES and PM2.5 on cardiovascular outcomes. They utilized a hierarchical clustering approach to identify SES groupings, where clusters 1 and 2 exhibited high proportions of black population, impoverished, nonmanagerial populations, unemployed, and single parent households; while cluster 3 was urban, with high proportions of college degree, and low poverty, nonmanagerial, and unemployed. These groupings have some similarities to our Disadvantage and Advantage latent classes. In their model, all of the SES variables are in a single domain with multiple levels, while we consider two domains with multiple levels and utilize the association between these two domains. They also noted higher impact of PM2.5 on cardiovascular outcomes in the lower SES areas. Brochu and colleagues (49), in a PM10 and PM2.5 model in the Northeastern United States also found that annual PM was consistently and significantly higher in census tracts with lower socioeconomic position, based on cost of living adjusted median household income. In a review of North American studies of criterion air pollutants and SES (50), most studies found a similar relationship of higher air pollutants in areas with lower SES. Some exceptions existed to this general pattern, for example, in New York City, in a borough-specific analysis, the Bronx, Staten Island, and areas of Manhattan exhibited an opposite pattern. In Los Angeles, PM2.5 and O3 levels were similar across SES, but other pollutants were higher for lower SES.
Previous research has shown that when a latent class level has too few observations, it is not meaningful to include in the model (51). Often in latent models, the AIC, BIC, and entropy continue to improve based on increasing the number of latent variable classes and not necessarily based on better fit. This can be mitigated through cross-validation (52). In our analysis, each individual latent variable has sufficient membership in both the high and low levels (illustrated in Supplementary Fig. S9 of the Supplementary Materials). We did see sparse membership once we consider the combination of the high and low levels of SES and pollution. In fact, the probability of assignment to Level 2 is nonzero, but it is small in many areas. So, we are not surprised to see that we do not observe any census tracts assigned to Level 2 in NC. We anticipate that if the analysis were repeated for a larger geographic region, we may see a nonzero, but still a small number of tracts assigned to Level 2. Our earlier work identified a small proportion of zip codes assigned to the combination of High Advantage, High Disadvantage (40). We think this particular combination represents neighborhoods that are in flux, and longitudinally could represent gentrification or decline.
We recognize that the NATA and area-level SES data used do not represent actual individual exposures. However, it is recognized that associations with health outcomes from such area-level measures can be informative. Widely used for health research including cancer studies (53–58), the NATA and ACS data are largely useful for large-scale time trend and spatial analysis and are limited in their usefulness for analysis on a fine spatial and temporal scale. This may be mitigated using datasets with finer spatial and temporal resolution (e.g., EPA Federal Reference Method Air Quality Monitors or the Community Multiscale Air Quality Modeling System) and/or by interpolating the data to achieve higher resolutions (59). There is ongoing research to develop causal models of environmental exposure on health outcomes, and in that setting, it may be critical to utilize specific exposure data. Future work is needed to extend such causal models to the MPE and SES framework we propose for cancer disparities.
Disclosure of Potential Conflicts of Interest
T. Hyslop reports grants from NCI during the conduct of the study and personal fees from AbbVie outside the submitted work (not related to submitted work). No potential conflicts of interest were disclosed by the other authors.
Authors' Contributions
A. Larsen: Data curation, software, formal analysis, validation, writing–original draft, writing–review and editing. V. Kolpacoff: Data curation, formal analysis, writing–review and editing. K. McCormack: Conceptualization, software, writing–original draft, writing–review and editing. V. Seewaldt: Conceptualization, resources, writing–review and editing. T. Hyslop: Conceptualization, resources, validation, methodology, writing–original draft, project administration, writing–review and editing.
Acknowledgments
This work is supported by the NCI of the NIH under award number NCI R01CA220693 (awarded to V. Seewaldt and T. Hyslop, supporting A. Larsen, V. Seewaldt, and T. Hyslop), and this material is also based upon work supported by the National Science Foundation under grant number DGE 1545220 (awarded to C. Gunsch, supporting K. McCormack).