Abstract
Background: Few older adults achieve recommended physical activity levels. We conducted a “neighborhood environment-wide association study (NE-WAS)” of neighborhood influences on physical activity among older adults, analogous, in a genetic context, to a genome-wide association study.
Methods: Physical Activity Scale for the Elderly (PASE) and sociodemographic data were collected via telephone survey of 3,497 residents of New York City aged 65 to 75 years. Using Geographic Information Systems, we created 337 variables describing each participant's residential neighborhood's built, social, and economic context. We used survey-weighted regression models adjusting for individual-level covariates to test for associations between each neighborhood variable and (i) total PASE score, (ii) gardening activity, (iii) walking, and (iv) housework (as a negative control). We also applied two “Big Data” analytic techniques, LASSO regression, and Random Forests, to algorithmically select neighborhood variables predictive of these four physical activity measures.
Results: Of all 337 measures, proportion of residents living in extreme poverty was most strongly associated with total physical activity [−0.85; (95% confidence interval, −1.14 to −0.56) PASE units per 1% increase in proportion of residents living with household incomes less than half the federal poverty line]. Only neighborhood socioeconomic status and disorder measures were associated with total activity and gardening, whereas a broader range of measures was associated with walking. As expected, no neighborhood meaZsures were associated with housework after accounting for multiple comparisons.
Conclusions: This systematic approach revealed patterns in the domains of neighborhood measures associated with physical activity.
Impact: The NE-WAS approach appears to be a promising exploratory technique. Cancer Epidemiol Biomarkers Prev; 26(4); 495–504. ©2017 AACR.
See all the articles in this CEBP Focus section, “Geospatial Approaches to Cancer Control and Population Sciences.”
Introduction
Physical activity, bodily movement produced by skeletal muscles, prevents colon and breast cancer, even after accounting for differences in body size (1–3). Physical activity also prevents obesity, which itself causes cancer at 13 organ sites including breast, colon, kidney, pancreas, and esophagus (4). There is also evidence that physical activity can reduce fatigue, improve quality of life, and improve survival among cancer patients (5–8). Among cancer survivors, particularly colorectal and breast cancer survivors, obesity increased more rapidly from 1997 to 2014 than in the general population (9). Roughly 20% of older adults (individuals above the age of 65) living in the United States are cancer survivors (10); promoting physical activity among older adults may thus substantially improve cancer outcomes in the population. The Centers for Disease Control and Prevention recommends that older adults engage in at least five hours a week of moderate aerobic activity (or equivalent amounts of vigorous activity), in part to prevent cancer (11). Yet in practice, few older adults meet physical activity recommendations (12, 13).
One thread of physical activity promotion, including among older adults, has sought to identify and remove neighborhood contextual barriers to physical activity as part of everyday life (14–16). For example, safe and attractive sidewalks may make older adults choose to walk rather than drive for certain short trips, thereby increasing their activity levels (17, 18). However, despite considerable qualitative evidence supporting the concept that supportive neighborhood environments encourage physical activity in older adults (19, 20) and the development of numerous theoretical frameworks (14, 21–23) exploring the conceptual links between neighborhood environments and physical activity, quantitative evidence confirming that specific neighborhood features support specific activities has been inconsistent (24).
One reason for inconsistency may be the difficulty of objectively operationalizing the neighborhood constructs described qualitatively. For example, in interviews, older adults frequently indicate that they do not like to walk in their neighborhoods if they feel that they may be targeted for crime (19). However, measuring neighborhood crime risk is extremely difficult (25). Whereas one study may operationalize neighborhood crime as reported neighborhood crime in an administrative area such as a county or zip code (26), another may ask subjects to report their perceptions of neighborhood safety (27), and a third might use geostatistical techniques such as ordinary kriging to estimate prevalence of crime within a buffer around subject homes (28). If the third measure most accurately reflected the true impediment to activity, both the zip code-based and self-report–based studies would likely underestimate true effects due to nondifferential measurement error (29).
A complementary reason for quantitative inconsistency is the sheer number of ways researchers characterize neighborhoods using “Big Data,” geographic information systems (GIS) tools and spatial analysis techniques (30–32). There is no perfect definition of neighborhood; the physical space connoted by “neighborhood” is subjective and may vary as older adults lose or recover functional capacity (33–35). As a result, researchers define neighborhood in many different ways, including radial buffers (the area within a given distance “as the crow flies” from the subject's home address), network buffers (the area reachable via walking a given distance walk on city streets), and activity spaces [the area a subject was observed to travel through over some period during which the subject wore a Global Positioning System (GPS) device; refs. 34, 36]. These differences in neighborhood specification can strongly affect study validity (37, 38).
Molecular epidemiologists have developed analytic approaches to explore similar groups of measures systematically (39). In a genome-wide association study (GWAS), “agnostic” analytic approaches are used to search the whole genome for the strongest genetic associations, which are then assumed to be the best candidates for subsequent research (40). Recently, the agnostic paradigm has been extended to high-throughput “-omic” fields focused on biomarker discovery and environmental sciences, where it has been termed an “environment-wide association study” (EWAS; refs. 40–43). Although agnostic approaches have limited causal interpretation, their systematic nature also enables straightforward replication (39, 44). As neighborhood datasets increasingly resemble GWAS and EWAS datasets, it follows that neighborhood research may similarly benefit by drawing on agnostic research paradigms.
Taking explicit inspiration from the GWAS and EWAS approaches, we propose and illustrate the neighborhood environment-wide association study (NE-WAS) design. We analyze each neighborhood measure's ability to predict total physical activity and each of three specific subtypes of activity, controlling for individual characteristics. Next, we examine the robustness of identified associations to variation in neighborhood definition. Finally, in analyses, we consider two machine learning algorithms that select variables algorithmically (LASSO and Random Forest) and have been used in epigenetic analyses, as potential components of a NE-WAS.
Materials and Methods
Subjects and setting
We use data from NYCNAMES-II, a study of 3,497 residents of New York City aged 65–75 years. Sampling and recruitment for NYCNAMES-II has been described previously (45). A total of 279 subjects (8%) reported poor health and were excluded, leaving 3,218 subjects for the primary analysis. Briefly, subjects were recruited by phone in 2011 from a telephone list purchased from InfoUSA. Phone numbers had previously been geocoded to census tracts, and numbers were selected to ensure geographic coverage of the city. Final survey weights were then raked [i.e. sample weights were recomputed so the weighted sample approximates the joint distribution of known population characteristics (46)] to New York City population estimates from the 2006–2010 American Community Survey for gender and race/ethnicity and from 2010 Census estimates for educational attainment and borough of residence. For this analysis, we used only data from the baseline interview. Descriptive statistics for the cohort are reported in Table 1
. | Full cohort (N = 3,497) . | Fair or better health (N = 3,218) . | ||
---|---|---|---|---|
Characteristic . | N (%) . | Sample weighted % . | N (%) . | Sample weighted % . |
Age | ||||
65–68 | 1,045 (33) | 34 | 956 (33) | 33 |
69–71 | 664 (21) | 20 | 608 (21) | 21 |
72–75 | 1,442 (46) | 46 | 1,335 (46) | 47 |
Sex | ||||
Female | 2,094 (60 | 58 | 1,907 (59) | 57 |
Male | 1,403 (40) | 42 | 1,311 (41) | 43 |
Race/Ethnicity | ||||
Non-Hispanic white | 1,800 (51) | 47 | 1,701 (53) | 48 |
Non-Hispanic black | 1,073 (31) | 26 | 974 (30) | 26 |
Hispanic | 245 (7) | 14 | 209 (6) | 13 |
Other | 379 (11) | 12 | 334 (10) | 12 |
Educational attainment | ||||
Less than high school | 673 (19) | 32 | 570 (18) | 30 |
Completed high school | 949 (27) | 29 | 870 (27) | 29 |
Some college | 627 (18) | 15 | 570 (18) | 15 |
Completed college | 1,248 (36) | 24 | 1,208 (38) | 25 |
Household income | ||||
Less than $20,000 | 1,279 (37) | 40 | 1,097 (34) | 37 |
$20,000–40,000 | 842 (24) | 24 | 790 (25) | 24 |
$40,000–80,000 | 745 (21) | 21 | 711 (22) | 22 |
More than $80,000 | 631 (18) | 16 | 620 (19) | 17 |
Health status | ||||
Excellent | 645 (18) | 17 | 645 (20) | 18 |
Good | 1,523 (44) | 42 | 1,523 (47) | 46 |
Fair | 1,050 (30) | 33 | 1,050 (33) | 36 |
Poor | 279 (8) | 9 | – (–) | – |
Activity measures | ||||
Walked 5–7 days in the last week | 1,346 (38) | 42 | 1,154 (36) | 39 |
Gardened in the last week | 710 (20) | 23 | 686 (21) | 24 |
Performed heavy housework in the last week | 1,872 (54) | 54 | 1,793 (56) | 57 |
. | Full cohort (N = 3,497) . | Fair or better health (N = 3,218) . | ||
---|---|---|---|---|
Characteristic . | N (%) . | Sample weighted % . | N (%) . | Sample weighted % . |
Age | ||||
65–68 | 1,045 (33) | 34 | 956 (33) | 33 |
69–71 | 664 (21) | 20 | 608 (21) | 21 |
72–75 | 1,442 (46) | 46 | 1,335 (46) | 47 |
Sex | ||||
Female | 2,094 (60 | 58 | 1,907 (59) | 57 |
Male | 1,403 (40) | 42 | 1,311 (41) | 43 |
Race/Ethnicity | ||||
Non-Hispanic white | 1,800 (51) | 47 | 1,701 (53) | 48 |
Non-Hispanic black | 1,073 (31) | 26 | 974 (30) | 26 |
Hispanic | 245 (7) | 14 | 209 (6) | 13 |
Other | 379 (11) | 12 | 334 (10) | 12 |
Educational attainment | ||||
Less than high school | 673 (19) | 32 | 570 (18) | 30 |
Completed high school | 949 (27) | 29 | 870 (27) | 29 |
Some college | 627 (18) | 15 | 570 (18) | 15 |
Completed college | 1,248 (36) | 24 | 1,208 (38) | 25 |
Household income | ||||
Less than $20,000 | 1,279 (37) | 40 | 1,097 (34) | 37 |
$20,000–40,000 | 842 (24) | 24 | 790 (25) | 24 |
$40,000–80,000 | 745 (21) | 21 | 711 (22) | 22 |
More than $80,000 | 631 (18) | 16 | 620 (19) | 17 |
Health status | ||||
Excellent | 645 (18) | 17 | 645 (20) | 18 |
Good | 1,523 (44) | 42 | 1,523 (47) | 46 |
Fair | 1,050 (30) | 33 | 1,050 (33) | 36 |
Poor | 279 (8) | 9 | – (–) | – |
Activity measures | ||||
Walked 5–7 days in the last week | 1,346 (38) | 42 | 1,154 (36) | 39 |
Gardened in the last week | 710 (20) | 23 | 686 (21) | 24 |
Performed heavy housework in the last week | 1,872 (54) | 54 | 1,793 (56) | 57 |
Measures
During the baseline interview, each subject reported his or her age, educational attainment, health status, income, race/ethnicity, and sex. Because we theorized that the neighborhood environment should only be able to influence physical activity among subjects whose health permitted outdoor physical activity, we excluded those who reported poor health from the primary analysis.
We assessed past-week physical activity using sixteen items derived from the Physical Activity Scale for the Elderly (PASE; refs. 47–49), a validated survey tool designed for use with older adults. The PASE instrument assesses past-week physical activity in a number of domains, including strengthening exercises, sports and recreation, walking, gardening, and housework. The PASE score is a linear combination of all sixteen items that reflects total physical activity [r = 0.68 with a doubly labeled water assessment of physical activity, considered the gold standard metabolic measure of energy expenditure (50)—in one validation study; refs. 47–49]. PASE scores in the included subjects ranged from 0 to 296 and were slightly right-skewed, with a mean of 84 and a median of 77. Thirty-nine percent of the subjects reported daily walking, 23% percent reported gardening, and 57% reported doing heavy housework.
Neighborhood measures.
During the baseline interview, each subject reported their home address. We geocoded these addresses using GeoSupport, a New York City–specific geocoding tool released by New York's Department of City Planning. Ninety-six percent of addresses were successfully geocoded to a rooftop; the remaining 4% were assigned the age 64–73 population weighted centroid of the reported ZIP code as a home location. For each subject, we defined the residential neighborhood as the land area reachable by city streets within a given distance of the geocoded home location, an area referred to in neighborhood research as a network buffer (51–53). Our primary analysis used 0.25 km network buffers, which contain the area accessible within a 5-minute walk for a 70-year-old woman with a comfortable gait speed within two SDs of the mean (54).
For each subject, we compiled 337 unique neighborhood measures. Specifically, demographic and economic characteristics came from the 2006–2010 American Community Survey. Urban form measures were constructed from TIGER/Line shapefiles describing street layout, the New York Metropolitan Transit Authority's ridership reports, and a LiDAR scan of the city (55, 56). Crime and disorder measures were compiled from a measure of crime risk developed by ESRI, Inc., municipal street cleanliness records, a systematic virtual audit using Google Street View imagery, and homicide incident locations as reported by the New York City Police Department to the New York Times (57–60). Parks measures, including boundaries and park cleanliness, were obtained from The New York City Department of Parks and Recreation (61). Pedestrian and cyclist injury counts were compiled from records initially recorded by reporting police officers (62).
We categorized measures into bins according to the aspect of the urban environment each captured (Table 2). These bins are analogous to chromosomes in genomic studies, genes in epigenomic studies, or “class groupings” in an environment-wide association study (42).
Domain . | Number of measures . | Data source(s) . | Examples . |
---|---|---|---|
Demographics and housing characteristics | 121 | American Community Survey | Population density, % white alone, % boys ages 10–14 |
Education, employment, and income | 102 | American Community Survey | % College grad, % in labor force, % in food prep sector |
Urban form and walkability | 50 | American Community Survey, New York State Accident Location Information Service Line Layer, NYC Transit Authority | % walk to work, density of 4-way intersections, Bus stop density,% of roadbed covered by tree canopy |
Crime and disorder | 35 | Esri Crime Risk, Google Street View, New York Times Homicide Map, NYC Sanitation Department Report Cards | Weighted average risk of larceny, Mean neighborhood disorder, % filthy streets |
Parks | 5 | New York City Department of Parks and Recreation | % of land area in large parks |
Pedestrian safety | 24 | New York State Department of Transportation and New York City Police Department | Cyclist injury density in the 1990s, Pedestrian fatality density in the 2000s |
Domain . | Number of measures . | Data source(s) . | Examples . |
---|---|---|---|
Demographics and housing characteristics | 121 | American Community Survey | Population density, % white alone, % boys ages 10–14 |
Education, employment, and income | 102 | American Community Survey | % College grad, % in labor force, % in food prep sector |
Urban form and walkability | 50 | American Community Survey, New York State Accident Location Information Service Line Layer, NYC Transit Authority | % walk to work, density of 4-way intersections, Bus stop density,% of roadbed covered by tree canopy |
Crime and disorder | 35 | Esri Crime Risk, Google Street View, New York Times Homicide Map, NYC Sanitation Department Report Cards | Weighted average risk of larceny, Mean neighborhood disorder, % filthy streets |
Parks | 5 | New York City Department of Parks and Recreation | % of land area in large parks |
Pedestrian safety | 24 | New York State Department of Transportation and New York City Police Department | Cyclist injury density in the 1990s, Pedestrian fatality density in the 2000s |
Some measures, such as density of vehicle collisions involving pedestrians, were right-skewed. To be consistent with best practices in agnostic studies and maximize comparability between environmental predictors, we transformed such skewed predictors before analysis (42). To assess skew for each measure, we visually compared a histogram of the measure and the measure after log-transformation. We conducted a preliminary investigation of an automated procedure to decide whether to log-transform measures, detailed in Supplementary Fig. S1. We retained log-transformed measures for analysis in place of untransformed measures if the log-transformed measure visually appeared closer to a normal distribution than its untransformed analogue.
For every pair of perfectly correlated measures (for example, proportion of occupied homes occupied by owners and proportion of occupied homes occupied by renters have a correlation coefficient of −1), we excluded one of the measures.
Supplementary Table S1 includes a complete list of all 337 contextual measures used in the final analysis, including their underlying data sources, and whether or not we log-transformed the measure before analysis.
Statistical analysis
Following the GWAS analytic approach, we used linear regression to model PASE score from each neighborhood environment variable individually. In addition, we used logistic regression analogously to estimate the strength of association between each variable and engaging in each of three activities: (i) daily walking, (ii) gardening, and (iii) “heavy housework” (e.g. vacuuming, sweeping, moving furniture; ref. 49). We hypothesized on the basis of prior literature that daily walking would be associated with measures of urban form (63, 64) and that, because lack of outdoor space poses a barrier to gardening in many but not all neighborhoods of New York City, gardening would be associated with housing characteristics (65). We selected heavy housework as a “negative control” (66). That is, because we believe that neighborhood conditions do not affect participation in heavy housework, we can interpret a finding that a large number of neighborhood exposures are associated with housework or that a pattern of exposures similar to the pattern predictive of other activity measures are associated with housework as evidence of residual confounding. Whereas gardening and heavy housework are dichotomous measures in the PASE instrument, daily walking is not; we considered those who reported 5–7 days of past-week walking to be daily walkers. All regression analyses incorporated survey weights and controlled for individual's age, race/ethnicity, educational attainment, income, and home size. After Bonferroni correction for 337 comparisons, we had adequate sample size to detect a change of 3.5 PASE units (roughly 10 minutes of walking per day) for each SD change in neighborhood exposure with a probability of 0.73 (67).
Next, to explore how buffer size affects the pattern of types of measures correlated with activity, we repeated all regression analyses with each measure computed for a 1-km network buffer around the subjects' home address. We then compared the estimated regression parameters for measures at 1 km to regression parameters for measures at 0.25 km to identify instances where neighborhood measures were more predictive at one neighborhood scale as opposed to the other.
Finally, to explore how developments in computer science might inform future NE-WASs, we investigated two algorithmic approaches (LASSO regression and Random Forest) that select the neighborhood characteristics most predictive of physical activity in multivariable models (68, 69). The variables remaining in models tuned using cross-validation can be regarded as the most informative variables (70). We selected these algorithms because they are widely used, software is readily available, and analogous approaches are increasingly common in GWAS studies searching for gene–gene interactions (71) and in epigenetic studies (41). These explorations are detailed in the online supplement.
Missing data
Relatively little data was missing on physical activity (maximum of 1.8% on any PASE item) or demographic covariates (16.2% were missing household income data; no other items were missing for more than 10% of subjects), and no data were missing on a neighborhood covariates. Nonetheless, to address potential non-response biases, we performed all survey-weighted regressions on each of 5 datasets where missing values were imputed using multivariate sequential regression as implemented by IVEWARE (72) with all available survey responses included in the prediction model. Following standard practice, we combined the estimates resulting from these models using Rubin's rules (73).
Sensitivity analyses
To test our regression results' sensitivity to the assumption that neighborhood characteristics were not important for those who reported poor health, we repeated the primary analysis with the full cohort of 3,497 subjects.
Software
All analyses used 64-bit R for Windows version 3.2.3.
Results
Characteristics associated with physical activity
In linear regression models controlling for individual covariates, measures of neighborhood resident socioeconomic position and physical disorder were most strongly associated with total physical activity. Specifically, the proportion of residents living in households with incomes less than half the poverty level was the most strongly associated with PASE score, with an estimated decrease of 0.85 (95% CI: 0.56–1.14) PASE score units per 1% increase in proportion of residents living in households with incomes less than half the federal poverty line. This association size is equivalent to 10 minutes less of daily walking per 4% decrease in proportion of residents living below half the federal poverty line. The remaining four of the top five measures included three other measures of resident socioeconomic position, all showing correlations between higher numbers of higher-income residents and more physical activity among the NYCNAMES II subjects, and one disorder measure, showing well-maintained windows, a marker of building upkeep, to be correlated with more activity (Table 3). After Bonferroni correction, no measure of resident demographics, parks, urban form, or pedestrian and cyclist safety were associated with PASE score.
. | PASE Score . | Gardening . | Walking daily . | Heavy housework . |
---|---|---|---|---|
Count of measures that remained significant after Bonferroni correction | 5 (1.5%) | 33 (9.8%) | 49 (14.4%) | 0 (0.0%) |
Top 5 statistically significant neighborhood measures (by P value of coefficient) | People living in households with incomes less than half the poverty level (−) | People living in households with incomes less than half the poverty level (−) | Proportion of residents with 60- to 90-minute travel time to work (−) | — |
People living in households with incomes below the poverty line (−) | Neighborhood Physical Disorder (−) | Broken windows in HVS survey (−) | — | |
No problems with windows in HVS survey (+) | People living in households with incomes below the poverty line (−) | Proportion of adult population with at least some college education (+) | — | |
People living in households with incomes more than twice the poverty level (+) | People living in households above twice the poverty line (+) | Proportion of working adult population commuting by car, truck, or van (−) | — | |
People living in households with incomes between half and three-quarters of the poverty level (−) | People living in households with any interest, dividend, or rental income (+) | Proportion of adult population working in professional or management industries (+) | — |
. | PASE Score . | Gardening . | Walking daily . | Heavy housework . |
---|---|---|---|---|
Count of measures that remained significant after Bonferroni correction | 5 (1.5%) | 33 (9.8%) | 49 (14.4%) | 0 (0.0%) |
Top 5 statistically significant neighborhood measures (by P value of coefficient) | People living in households with incomes less than half the poverty level (−) | People living in households with incomes less than half the poverty level (−) | Proportion of residents with 60- to 90-minute travel time to work (−) | — |
People living in households with incomes below the poverty line (−) | Neighborhood Physical Disorder (−) | Broken windows in HVS survey (−) | — | |
No problems with windows in HVS survey (+) | People living in households with incomes below the poverty line (−) | Proportion of adult population with at least some college education (+) | — | |
People living in households with incomes more than twice the poverty level (+) | People living in households above twice the poverty line (+) | Proportion of working adult population commuting by car, truck, or van (−) | — | |
People living in households with incomes between half and three-quarters of the poverty level (−) | People living in households with any interest, dividend, or rental income (+) | Proportion of adult population working in professional or management industries (+) | — |
NOTE: All analyses control for subject age, race/ethnicity, educational attainment, household income, gender, and home type.
Abbreviation: HVS, New York City Housing and Vacancy Survey
Logistic regression analyses focused only a single type of activity identified many more significant neighborhood correlates than analyses targeting total activity (Fig. 1, given in color in the online supplement). Measures of high neighborhood socioeconomic status were the strongest predictors of gardening, whereas a wide range of neighborhood characteristics predicted walking 5–7 days in the previous week (Table 3). Reassuringly, no neighborhood measures were predictive of heavy housework after Bonferroni correction.
Neighborhood characteristics and buffer size
Using 1 km rather than 0.25-km buffers led to more variables being significant after Bonferroni correction, but neither neighborhood measures for the 1-km buffers nor for the 0.25-km buffers were uniformly more strongly correlated with total PASE score (Fig. 2, given in color in the online Supplementary Data). Of the 337 neighborhood measures available at both scales, regression coefficients for associations with PASE scores changed signs for 38 (11%) of the neighborhood measures. None of the measures that changed sign were nominally significant at a P value of 0.05 at either neighborhood buffer scale. There was no clear pattern as to which neighborhood measures were better correlated with PASE scores at the two scales (Table 4).
. | Measure . | Difference in –log10P value . |
---|---|---|
Better at 0.25-km scale | Proportion of population living in households (+) | 2.24 |
Proportion of population who are naturalized citizens (−) | 1.68 | |
Proportion of population living in households with income below half the poverty level (−) | 1.57 | |
Density of 3-way intersections (+) | 1.28 | |
Proportion of vacant housing units offered for rent (−) | 1.27 | |
Better at 1.0-km scale | Proportion of households with incomes between 25K and 30K (−) | 2.77 |
Proportion of adult residents with a professional degree or more (+) | 2.56 | |
Proportion of households with incomes between 30K and 35K (−) | 2.56 | |
Proportion of family households living below the poverty line with a male householder and no children under age 18 (−) | 2.45 | |
Proportion of total population aged 10 to 14 years | 2.42 |
. | Measure . | Difference in –log10P value . |
---|---|---|
Better at 0.25-km scale | Proportion of population living in households (+) | 2.24 |
Proportion of population who are naturalized citizens (−) | 1.68 | |
Proportion of population living in households with income below half the poverty level (−) | 1.57 | |
Density of 3-way intersections (+) | 1.28 | |
Proportion of vacant housing units offered for rent (−) | 1.27 | |
Better at 1.0-km scale | Proportion of households with incomes between 25K and 30K (−) | 2.77 |
Proportion of adult residents with a professional degree or more (+) | 2.56 | |
Proportion of households with incomes between 30K and 35K (−) | 2.56 | |
Proportion of family households living below the poverty line with a male householder and no children under age 18 (−) | 2.45 | |
Proportion of total population aged 10 to 14 years | 2.42 |
NOTE: Plus or minus indicates the direction of association between the neighborhood measure and PASE score.
Algorithmic variable selection
The best fitting LASSO regression models for each outcome incorporated roughly the same number of neighborhood variables as were significant in conventional regression for the respective outcome (3 for total PASE score, 45 for gardening, 22 for walking, and 0 for heavy housework). However, the specific variables that were selected were highly sensitive to model tuning parameters, limiting substantive interpretability. The variables ranked as highly important in Random Forests frequently all belonged to the same neighborhood measurement domain (e.g., all variables important for gardening were related to housing characteristics) but did not match the variables selected by LASSO regression or conventional regression. The final LASSO model explained 10.1% of variation in PASE score; in contrast, the Random Forest explained −3.6% of PASE variation, suggesting the final model was worse than chance. Results from algorithmic variable selection are discussed in more detail in the online Supplementary Data.
Sensitivity analyses
Subjects who were excluded from the primary analysis owing to poor health were more likely to be female, to be racial/ethnic minorities, to have lower household incomes, and to be less educated (Table 1). However, the sensitivity analysis conducted using the full cohort identified the same top 5 measures, albeit in a different order (Supplementary Table S2).
Discussion
In this analysis, we explored a novel agnostic NE-WAS approach to selecting the neighborhood measures most strongly associated with total physical activity, as well as specifically with walking, gardening, and housework. In our study, the most strongly predictive measure of total physical activity was proportion of residents living in households with incomes below half the federal poverty threshold, equating to $11,056 for a family of four. Neighborhood socioeconomic and disorder measures were most associated with total activity. Socioeconomic measures also strongly predicted gardening, whereas measures of commute distance and commute times were more relevant for walking. As expected, no neighborhood measures significantly predicted heavy housework. Overall, the NE-WAS approach appears promising, and our findings suggest NE-WAS may be appropriate for other neighborhood-associated health conditions as well, such as obesity (64), breast cancer (74), or cardiac arrest (75).
More neighborhood environment measures were significantly associated with our specific outdoor activity measures, walking and gardening, than with physical activity as a whole, while no neighborhood measures significantly predicted heavy housework. Our findings thus serve as empirical support for prior calls, typically made on theoretical grounds, to consider influences on differing domains of activity separately (24, 25, 76–78).
There are several interpretations for our finding that neighborhood socioeconomic measures were more consistently associated with activity measures than neighborhood characteristics (e.g. access to parks) that have more direct theoretical relevance to specific forms of outdoor activity. It may be that residents of higher socioeconomic status neighborhoods have used their resources to shape neighborhoods to offer more support for different forms of activity among older adults (79), including dedicated outdoor space that supports gardening and well-maintained sidewalks or amenities such as benches and public restrooms that older adults cite as necessary supports for walking (80). A complementary explanation is that residual confounding due to incomplete control for individual socioeconomic position is responsible for this association. Higher socioeconomic position older adults are typically more physically active (81, 82), and tend to live in neighborhoods with other high socioeconomic position individuals. Our analysis controlled for household income and educational attainment, but neither fully captures socioeconomic position among older adults (83).
While the NE-WAS approach explicitly draws an analogy between genetics and neighborhoods, we caution, as others have, that there are vital differences between genomes and modifiable exposures like neighborhoods (44). Most importantly, unlike the SNPs that act as independent variables in a GWAS, wherein there are few if any correlations between polymorphisms on separate chromosomes, the correlation structure underlying neighborhood characteristics is strong, complex, and potentially causally circular (84). However, in proteomics and metabolomics research, wherein measured molecules do show strong and complex intercorrelations, identified molecules are considered to be markers of a process rather than causes of the process and a separate scientific approach, pathway analysis, has developed to integrate knowledge from agnostic analyses to develop and test causal hypotheses (85, 86). There are analogous systems science–derived integrations of knowledge in neighborhood research (e.g., ref. 87), although such approaches are still in their infancy. Nonetheless, we anticipate that in this sense the NE-WAS approach is more akin to an omics approach than a GWAS: the value of the NE-WAS stems not from a precise estimate of the causal effect of some neighborhood characteristic but rather from the ability to systematically identify targets for future exploration and to reveal reproducible patterns in associations across cohorts (39).
While this analysis addressed neighborhood factors correlated with physical activity, the NE-WAS approach could be applied to explore other contextual research questions. For example, NE-WAS might help to systematically explore the appropriate measures and geographic levels at which to understand the disparities in cancer incidence (35). NE-WAS may also be of value for standardizing neighborhood definitions or selecting the spatial scale at which a neighborhood construct is most relevant (24). Finally, NE-WAS with pooled or multisite studies would allow a systematic assessment of correlation patterns between geographic regions, quantifying variation in susceptibility to neighborhood environment risk factors (88).
In general, algorithmic variable selection resulted in substantively uninterpretable models. We caution, however, that our investigation was limited to two techniques that were designed for prediction rather than for explanation and happen to embed variable selection within the predictive modeling approach. Future NE-WASs might explore not only more aggressive exclusion of collinear measures but also other algorithmic approaches, including multifactor dimension reduction (89), a technique that explicitly aims to identify interactions that might not be uncovered by conventional analytic approaches. Further analysis of correlation structures among neighborhood predictors, which were out of scope for this analysis, may shed further light on the most appropriate analytic techniques for future NE-WASs (90, 91).
This study had several notable strengths. First, the relatively large and population-based sample of older adults residing in a very well-characterized urban environment allowed for relatively precise estimates of associations between neighborhood characteristics and activity outcomes. Second, the use of a survey measure that included items assessing types of activity allowed us to incorporate analyses that target activity measures representing a range of hypothesized susceptibility to neighborhood influence (24, 77). Third, without a theoretical basis to guide variable selection, agnostic studies are at risk of identifying strongly confounded variables. In this light, our “negative control” finding that no environmental factors were associated with heavy housework after Bonferroni correction provides some, albeit incomplete, evidence against residual confounding (66).
However, our results should be viewed in light of several limitations. First, the 337 neighborhood measures we analyzed comprise only the measures of New York City's urban environment that were readily available to the research team. Future NE-WASs might productively undertake a more systematic exploration of neighborhood measures used in the literature to select a comprehensive set of measures to study, potentially incorporating neighborhood measures of no theoretical relevance as further negative controls. Second, we compared only two neighborhood definitions, 0.25-km network buffers and 1.0-km network buffers. It has been repeatedly noted that no single definition captures the construct of a neighborhood (24, 36); indeed, the meaning of neighborhood may be different for different measures, in different contexts, and for different subgroups (76). Future NE-WASs might broadly compare more buffer sizes for a single measure. Third, while New York City comprises a range of urban environments, including pockets of sidewalk-free post-war “sprawl,” it nonetheless contains a much more pedestrian-oriented environment than the United States as a whole, and a population at greater extremes of the socioeconomic spectrum. It may be productive to compare results from this NE-WAS to future NE-WASs conducted in environments more representative of the contexts in which most American older adults reside. Fourth, as in any agnostic study, our substantive findings should be viewed with caution until replicated in other cohorts (92). Fifth, the Bonferroni correction we used to account for multiple comparisons is likely overly conservative; future NE-WASs might explore estimating the false discovery rate instead (93). Finally, as in most neighborhood studies, we were unable to determine whether statistical adjustment for participant race/ethnicity and socioeconomic status fully account for residential self-selection (94).
In conclusion, the NE-WAS is a promising approach to empirically identify neighborhood measures most strongly related to measurable outcomes, including not only cancer-preventing behaviors such as physical activity, but also health outcomes such as cancer incidence. In this NE-WAS, neighborhood socioeconomic characteristics were more consistently associated with physical activity than measures of crime, parks, and pedestrian safety. We anticipate performing NE-WASs in other cohorts, other geographic contexts, and with other outcomes, to determine the replicability of the approach, to improve handling of multi-collinearity, and to deepen substantive findings (39).
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: S.J. Mooney, J.R. Beard, A.G. Rundle
Development of methodology: S.J. Mooney, G.J. Kennedy, J.R. Beard, A.G. Rundle
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): M. Cerdá, G.J. Kennedy, J.R. Beard, A.G. Rundle
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S.J. Mooney, J.R. Beard, A.G. Rundle
Writing, review, and/or revision of the manuscript: S.J. Mooney, S. Joshi, M. Cerdá, G.J. Kennedy, J.R. Beard, A.G. Rundle
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S. Joshi, A.G. Rundle
Study supervision: M. Cerdá, A.G. Rundle
Acknowledgments
The authors thank Thelma Mielenz, Alfred Neugut, Shuang Wang, and Ryan Demmer for their helpful comments on an earlier version of this work.
Grant Support
S.J. Mooney was supported by National Institute of Child Health and Human Development (NICHD) grant 5T32HD057822-07. S. Joshi, M. Cerdá, J.R. Beard, G.J. Kennedy, and A.G. Rundle were supported by National Institute for Mental Health grant 5R01MH085132-05.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.