Next Article in Journal
Two Types of Size-Biased Samples When Modeling Extreme Phenomena
Previous Article in Journal
Comparing the Relative Efficacy of Generalized Estimating Equations, Latent Growth Curve Modeling, and Area Under the Curve with a Repeated Measures Discrete Ordinal Outcome Variable
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Spatial–Temporal Bayesian Model for a Case-Crossover Design with Application to Extreme Heat and Claims Data

by
Menglu Liang
1,*,†,‡,
Zheng Li
2,‡,
Lijun Zhang
3,‡ and
Ming Wang
3,‡
1
Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park, MD 20742, USA
2
Novartis Pharmaceuticals, New Jersey, NJ 07936, USA
3
Department of Population and Quantitative Health Sciences, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
*
Author to whom correspondence should be addressed.
Current address: Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, Baltimore Ave, College Park, MD 20740, USA.
These authors contributed equally to this work.
Stats 2024, 7(4), 1379-1391; https://doi.org/10.3390/stats7040080
Submission received: 14 October 2024 / Revised: 3 November 2024 / Accepted: 7 November 2024 / Published: 19 November 2024

Abstract

:
Epidemiological approaches for examining human health responses to environmental exposures in observational studies frequently address confounding by employing advanced matching techniques and statistical methods grounded in conditional likelihood. This study incorporates a recently developed Bayesian hierarchical spatiotemporal model within a conditional logistic regression framework to capture the heterogeneous effects of environmental exposures in a case-crossover (CCO) design. Spatial and temporal dependencies are modeled through random effects incorporating multivariate conditional autoregressive priors. Flexible frailty structures are introduced to explore strategies for managing temporal variables. Parameter estimation and inference are conducted using a Monte Carlo Markov chain method within a Bayesian framework. Model fit and optimal model selection are evaluated using the deviance information criterion. Simulations assess and compare model performance across various scenarios. Finally, the approach is illustrated with workers’ compensation claims data from New York and Florida to examine spatiotemporal heterogeneity in hospitalization rates related to heat prostration.

1. Introduction

Environmental epidemiological studies frequently use aggregate health data to evaluate the short-term health effects of environmental exposures. For instance, daily counts of deaths, emergency department (ED) visits, and hospitalizations across a city have been correlated with short-term exposure to extreme temperatures [1,2].
The use of aggregate health data is an example of an ecological design, where the actual exposure corresponds to the distribution of individual-level exposure within the at-risk population [3,4]. However, it is impractical for large population-based studies to measure individual environmental exposures directly. Therefore, exposure surrogates that summarize individual-level exposure (e.g., the mean) are used. This approach can lead to exposure misclassification, resulting in biases and incorrect characterization of the uncertainties associated with health effect estimates [5,6].
In extreme climate conditions, extreme heat events (EHEs) are often the primary focus of study. Over short-term periods, there can be significant variability in exposure within the population due to (1) spatial differences in EHEs within the study area and (2) individuals spending time in various environments with different outdoor and indoor conditions. To tackle the first challenge, spatial–temporal models have been developed to estimate pollutant concentrations with fine spatial resolution, ensuring complete spatial–temporal coverage [7,8]. These estimates are then used to calculate short-term exposure values, which are linked to aggregated health outcome data across spatial units of interest (e.g., counties). Studies that account for spatial variability in exposure have been found to yield higher health effect estimates compared to those using data from sparse, stationary monitors, especially for pollutants with high spatial variability (e.g., traffic-related pollutants) [9,10].
Conversely, there has been less focus on addressing both location-based exposure heterogeneity and varying critical time windows simultaneously in ecological studies [11]. Factors contributing to spatial variation can include characteristics of the exposure (e.g., spatial differences in measurement error) and/or attributes of the study population that may be either excluded from the analysis or included as imperfect proxies for key covariates or confounders (e.g., race, ethnicity, and education level) [11,12]. To address spatial variation, Warren and Langlois [13,14] employed geostatistical methods to introduce spatial correlation between lagged regression parameters for different discrete spatial regions, defining correlation based on the distance between region centroids. However, this approach may be inadequate when spatial regions are irregularly shaped or vary significantly in size [15]. Furthermore, using distance-based spatial correlation models can increase computational complexity as the number of unique locations grows, due to the reliance on a Gaussian process with a spatial covariance function and the need to invert large matrices during each iteration of the model-fitting algorithm (typically achieved using Markov chain Monte Carlo sampling techniques) [16].
The study of short-term exposure effects using aggregated data is not a new concept. Some studies allow for heterogeneous effects through stratification or covariates adjustment [17,18], while others directly address these effects using the case-only approach developed by Armstrong [19]. In research on extreme temperatures, stratification has traditionally focused on demographic characteristics such as age, race/ethnicity, and sex. The case-only approach has also been applied to factors like chronic conditions, socioeconomic status, and various census tract characteristics [20,21,22,23]. We propose an extension to the widely used case-crossover (CCO) study design, where heterogeneous exposure–response relationships are modeled using a newly developed Bayesian spatial–temporal model [24]. This approach allows the model to flexibly learn and accommodate potentially complex heterogeneous exposure–response relationships during the fitting process. This study aims to (1) enhance the conventional case-crossover (CCO) design by incorporating a Bayesian spatiotemporal model to capture heterogeneous exposure–response relationships, thus enabling the analysis of environmental exposure effects across varied locations and times; (2) deepen understanding of differential impacts of environmental exposures on health outcomes by accounting for spatial and temporal dependencies; and (3) flexibly adapt to complex, location-specific exposure–response dynamics, allowing for the detection of population-specific variability that traditional models may miss.
The structure of the paper is as follows: the Methods section introduces the notation and provides a detailed description of the spatiotemporal models employed, along with the Bayesian algorithm used for parameter estimation and inference. The Simulation Study section describes the comprehensive simulations conducted to evaluate our approach. The Data Application section presents results from the motivating example using workers’ compensation claims data from Florida and New York. Lastly, the Discussion section provides concluding remarks and suggestions for future research.

2. Methods

2.1. Study Design

Case-crossover (CCO) designs are commonly used to analyze health data in environmental epidemiology studies when only case data, such as hospital visits, are available [25]. In a CCO design, each case is matched with a set of controls within a designated referent window, forming a stratum [26]. A widely used strategy for selecting this window is the time-stratified approach, which matches each case to three or four other dates within the same calendar month, ensuring the same day of the week [27]. This approach assumes cases are independent and occur infrequently enough that an individual would not experience the event more than once within the referent window. The time-stratified design is both localizable and ignorable, providing unbiased estimates of regression coefficients when conditional logistic regression [28] is used, enabling the analysis of exposure impacts over time. Figure 1 shows the study design of our real data example.
Our claims data include the date and county information for each claim, using injury codes to identify hospitalizations due to heat prostration. While the New York workers’ compensation claims dataset includes individual-level identifiers, the Florida dataset lacks this information. During the study period, a single individual may have multiple claims, and each year has unique individual identifiers. Additional details about the datasets are provided in the Supplementary Materials. Our data modeling approach is as follows:
Y i j k | , β ind Bernoulli { p i j k } , log { p i j k 1 p i j k } = x i j k T β .
For the k th claim ( k = 1 , , K i j ) from the i th county ( i = 1 , , I ) at the j th time point ( j = 1 , , J ) , x i j k denotes a p × 1 vector of covariates for regression, which may include time-dependent or time-varying factors of interest. The vector β , also p × 1 , contains the regression coefficients associated with the covariates x i j k .

2.2. Bayesian Spatial–Temporal Model

To capture spatial heterogeneity, we introduce a spatial random effect into the model, defined as
l o g { p i j k 1 p i j k } = x i j k T β + ω i .
Using the approach outlined in the spatial–temporal Bayesian accelerated failure time models [24], we assume that the random effect ω i follows a normal distribution N ( 0 , τ 2 ) . A conditional autoregressive (CAR) distributionBesag [29] for ω i is then proposed.
ω i | ω i N i i m i i ω i i i ω i i , 1 i i m i i τ 2 ,
where ω i = { ω 1 , , ω i 1 , ω i + 1 , , ω I } . It is important to note that m i i is defined as 1 if county i is adjacent to county i ; otherwise, m i i = 0 . The joint density function of ω = ( ω 1 , , ω I ) T can be expressed as follows:
Pr ( ω ) τ n exp 1 2 τ 2 ω T ( D ω C ) ω ,
where D ω = diag i 1 m 1 i , , i I m I i is a diagonal matrix, with the i th diagonal element representing the total number of counties adjacent to county i. Additionally, C = ( m i i ) I × I denotes the adjacency matrix for all counties included in the study.
Let k denote the referent window that includes observation times j for claim k. Given the observed data D = { ( Y i j k , k ) , i = 1 , , I , j = 1 , , J , k = 1 , , K i j } and spatial random effects ω , and operating under the case-crossover assumption that j Y i j k = 1 k = 1 , , K i j , the conditional data likelihood is expressed as
L ( D | ω ) = i = 1 I j = 1 J k = 1 K i j f ( Y i j k | ω i ) = i = 1 I j = 1 J k = 1 K i j exp ( x i j k T β + ω i + ϵ i j k ) j exp ( x i j k T β + ω i + ϵ i j k ) ,
where ϵ i j k represents the error term in the regression model, and we assume ϵ i j k N ( 0 , σ ϵ 2 ) .
Given that the workers’ compensation claims data encompass patients enrolled across different years (i.e., temporal cohorts), we extended the application of the multivariate conditional autoregressive (MCAR) prior in our model for further analysis [24,30,31]. Specifically, we utilized the modeling approach proposed by Wang [24], incorporating an additional random effect vector for the temporal cohorts in the i th county, denoted as γ i = ( γ i 1 , , γ i J ) T . Consequently, the spatial–temporal model can be expressed as follows:
log { p i j k 1 p i j k } = x i j k T β + z i k T ξ + η i T γ i + ω i ,
where z ik = ( z i 1 k , , z i J k ) T is a J × 1 vector with z i j k serving as a dichotomous temporal cohort indicator (1 for yes, 0 for no) for the k th claim from the i th county; ξ = ( ξ 1 , , ξ J ) T represents the fixed temporal effect; and η i = ( η i 1 , , η i J ) T , where η i j is a binary indicator covariate associated with the j th year-specific random effect γ i j .
Let Δ i = ( ω i , γ i T ) T , and we assume
Δ i N ( 0 , Λ 1 )
Δ i | Δ i M V N i i m i i Δ i i i m i i , 1 i i m i i Λ 1 ,
where Λ represents the hyper-parameters in the MCAR prior. Therefore, the joint density function of Δ = vec Δ 1 , , Δ I is
Pr ( Δ ) | Λ | 1 2 exp 1 2 Δ T ( D ω C ) Λ Δ .
Similarly, the conditional likelihood of the observed data D = { Y i j k , i = 1 , , I , j = 1 , , J , k = 1 , , K i j } is
L ( D | Δ ) = i = 1 I j = 1 J k = 1 K i j f ( p i j k ) ,
where f ( p i j k ) is the conditional density function, for the k th claim from the i th county in the j th temporal cohort. Denoting μ ( x i j k ) = x i j k T β + z i k T ξ + η i T γ i + ω i , we have
f ( p i j k ) = e x p ( μ ( x i j k ) ) j Ω e x p ( μ ( x i j k ) ) .
We consider the non-informative priors for β , ξ , and μ , where
Pr ( β ) 1 , Pr ( ξ ) 1 , μ 1 .
Also, with regard to random effects Δ i , we select a conjugated prior for Λ with Λ Wishart p , R . To ensure the prior remains vague, p can represent the dimension of Λ , and R can be arbitrarily specified as a diagonal matrix D i a g { 100 , , 100 } I × I . Further details on the Bayesian inference using the MCAR prior can be found in the Appendix A.

2.3. Model Selection Criteria

We utilize the deviance information criterion (DIC) to select the best candidate model. Let θ represent the complete parameter space of the model. Spiegelhalter et al. [32] introduced the following DIC within the Bayesian framework, which integrates the likelihood and posterior distributions:
D I C = 2 p D + D ( θ ¯ ) ,
where θ ¯ is the posterior mean of θ . It is important to note that D ( θ ¯ ) = 2 log L ( θ ¯ ) + C represents the deviance of the model based on our posterior estimates, with C being a constant in the DIC. Additionally, p D = D ¯ D ( θ ¯ ) is the difference between the mean of the deviance under the posterior distribution and the deviance evaluated at the posterior mean.
Inspired by the analysis of the workers’ compensation claims data, we considered the following two candidate models to simultaneously account for both temporal and spatial heterogeneity:
  • Model 1 (M1): log { p i j k 1 p i j k } = x i j k T β + ξ z i j k + γ i z i j k + ω i ;
  • Model 2 (M2): log { p i j k 1 p i j k } = x i j k T β + z i k T ξ + η i T γ i + ω i .
In Model 1 (M1), z i j k represents the year in which the claims were filed, treating the year as a continuous variable. In Model 2 (M2), z i k consists of indicator variables for the year when the claims were made. M1 incorporates a random slope to account for the temporal effect, defined as Δ i = { ω i , γ i } . In contrast, M2 includes a spatial–temporal intercept for each cohort and county, expressed as Δ i = { ω i , γ i j , i = 1 , , I ; j = 1 , , J } . We selected the optimal model for analyzing the workers’ compensation claims data based on the DIC, with a lower DIC indicating a better model fit. The procedures for parameter estimation, inference, and model diagnostics were implemented in R and utilized C functions. The relevant code is available at the GitHub link provided in Appendix B.

3. Simulation Study

3.1. Simulation Set-Ups

The comprehensive simulation study for this modeling approach is detailed in our previous work [24]. In this section, we concentrate on models that account for both temporal and spatial effects. We replicated the structure of the workers’ compensation claims data and performed extensive simulations across different scenarios. The data were generated using the following models:
  • Scenario 1 (S1): log { p i j k 1 p i j k } = x 1 , i j k + x 2 , i j k + ω i + γ i j ;
  • Scenario 2 (S2): log { p i j k 1 p i j k } = x 1 , i j k + x 2 , i j k + ω i + z i k T ξ , where ξ = ( 0 , 0.5 , 0.5 , 0.6 , 0.8 ) .
In this context, x 1 , i j k is a continuous variable generated from a standard normal distribution, N ( 0 , 1 ) ; x 2 , i j k is a binary variable that follows a Bernoulli distribution, B e r n o u l l i ( 0.5 ) . Both x 1 , i j k and x 2 , i j k vary across different claims, cohorts, and counties. The vector z i k = ( z i 1 k , , z i J k ) includes z i j k as the indicator for the j th temporal cohort. In Scenario S1, ω i j = ω i + γ i j is generated iteratively, accounting for the dependency between adjacent cohorts, with ω i j = 2 ω i , j 1 + ζ , where ζ is drawn from a standard normal distribution, N ( 0 , 1 ) .
To simulate environmental exposure and produce comparable data, we also treated x 1 , i j k as a county-level risk factor in each scenario to assess our methods. In other words, x 1 , i j k varies only between counties and not among individual claims, meaning that x 1 , i j k = x 1 , i .
Given that Florida has 67 counties, we mirrored this in our setup, with i = 1 , , 67 . To streamline computational complexity, we assumed five temporal cohorts per county ( J = 5 ) . The number of claims, K i j , for each county and temporal cohort was determined based on the proportion of cases in the actual workers’ compensation claims data, ensuring that each county had a minimum of five claims per day. To summarize the results, we generated 1000 Monte Carlo datasets for each scenario, and for each simulated dataset, we fitted the following models (M1 and M2) using a Bayesian framework:
  • M1: log { p i j k 1 p i j k } = β 1 x 1 , i j k + β 2 x 2 , i j k + ξ z i j k + γ i z i j k + ω i ;
  • M2: log { p i j k 1 p i j k } = β 1 x 1 , i j k + β 2 x 2 , i j k + z i k T ξ + η i T γ i + ω i .
It is important to note that η i and γ i are defined in Section 2.3, and the matrices D ω and C (the adjacency matrix) used in the MCAR and CAR priors were generated specifically for the counties in Florida. For parameter estimation, the posterior mean of each parameter was derived from 2000 MCMC samples, resulting in a total of 2500 posterior samples after discarding the first 500 samples as burn-in. While additional MCMC samples may be required depending on model convergence diagnostics, our numerical studies indicated that satisfactory results were achieved under these conditions (see results below). In summary, for the Monte Carlo replications, we report the bias (Bias), standard deviation (SD) of parameter estimates, mean squared error (MSE), and model selection probabilities based on the DIC.

3.2. Simulation Results

Here, we present the summary statistics for the estimates of the primary parameters β ^ 1 and β ^ 2 across various scenarios. Detailed results for all scenarios are available in Table 1. The bias values for β ^ 1 and β ^ 2 under M1 were the lowest among all candidate models; however, M1 displayed the poorest performance with the highest bias, especially in Scenario S1. In terms of mean squared error (MSE), M1 continued to show the best performance, yielding the smallest values across the different scenarios. Importantly, M2’s performance was similar to that of M1, demonstrating minimal bias and variability in Scenario S2.
We also assessed model fitting for scenarios where x 1 was the only risk factor that varied across counties; the results are presented in Table 2. Compared to the findings in Table 1, the bias of β ^ 1 in M1 was significantly higher, particularly in Scenario S2, when compared to M2, despite the mean squared errors (MSEs) being similar. Regarding model selection across different scenarios, the selection probabilities based on the DIC criterion are outlined in Table 3. While Model M1 was generally preferred, in Scenario S2—characterized by spatial and temporal heterogeneity with a consistent linear temporal trend within counties—Model M2 recorded the highest selection rate. Additional simulation results for a spatial model that does not account for the temporal cohort effect variation (M3) can be found in Appendix C.

4. Data Application

4.1. Data Source

The source of injury and illness outcome data used in this research are the state of Florida and New York workers’ compensation claims datasets provided by the state workers’ compensation boards of Florida (https://myfloridacfo.com/division/wc/ accessed on 1 July 2023) and New York (https://www.wcb.ny.gov/ accessed on 1 July 2023).These claims datasets represent all types of illnesses and injuries, and include work-related injury and illness claims registered and accepted by the respective state’s workers’ compensation board from 1 January 2000 to 31 December 2020. The information that can be distributed to the public by workers’ compensation boards is subject to state statutes, with an aim of keeping claims information unidentifiable to a specific person. Workers’ compensation claims data represent perhaps the only source of data on worker illness and injury for a large state population. For each claim record, information includes the date of injury or illness, the county where the injury or illness occurred, the nature of the injury or illness, the industrial classification, and other characteristics dependent on state statutes such as age and gender for New York claims.
The nature of injury codes for each claim was used to identify certain heat-related illnesses and non-heat-related illnesses. While Florida uses the National Council on Compensation Insurance (NCCI) coding scheme, New York has used the Workers’ Compensation Insurance Organization (WCIO) coding scheme since 2013, and formerly used the Occupational Injury and Illness Classification System (OIICS) coding system from 2010 to 2015.
In Florida, there were approximately 1.7 million claims with indemnity benefits processed from 1 January 2000 through 31 December 2020. In New York, there were approximately 4 million claims from events that occurred between 1 January 2000 through 31 December 2020, and these claims included those with indemnity benefits as well as medical only, non-compensated to date, and canceled claims. About 0.05% of the claims in the New York dataset and about 0.2% of the claims in the Florida dataset could be categorized as heat-related illness claims.
Daily meteorological data on maximum temperatures were obtained from the U.S. National Center for Environmental Information through the National Oceanic and Atmosphere Agency (National Climatic Data Center https://www.noaa.gov/education/resource-collections/data/climate accessed on 1 January 2023) for the baseline period, 1980 to 1999, and the study period, 2000 to 2019. The data included daily maximum temperature (Tmax) for each county and date combination. For each county, a calendar day-specific 95th percentile threshold was calculated from the baseline data (1980–1999) using a 31-day window centered on the particular calendar day of interest, i.e., day in which an illness occurred. For example, the 95th percentile Tmax threshold for May 15 was determined based on all Tmax values for that county spanning May 1–May 31 during the 1980–1999 period (20 years × 31 days/year = 620 daily Tmax values). The 95th percentile maximum temperature values were referred to as the Extreme Temperature Threshold 95th percentile (ETT95). This exposure metric captured the frequency of extreme events and is relevant in the context of climate change. Extreme temperatures may exist for only one day, but could also take place for multiple days as a heat wave, which may then have a different effect on illnesses and injuries. We also explored one-day lag exposure periods to determine if there is a lag that occurs between exposure and injury. Considering the computation time complexity of the Bayesian approach, we applied our proposed models to the New York and Florida workers’ compensation claims data for 2010–2020. We focus on the 3103 heat prostration illness claims reported in the 2010–2020 New York and Florida workers’ compensation claims data.

4.2. Results

We explored spatial–temporal models that account for both spatial and temporal heterogeneity for this data application. For each model, 1000 samples were drawn during the burn-in period, followed by an additional 2000 samples drawn from the posterior distribution.
The results for fixed-effect parameters are summarized in Table 4. Based on the DIC criterion, M2 was the best candidate model. Accordingly, a significantly higher number of claims was observed for those who were male, aged 65 and above, and had exposure to EHEs. A higher chance of hospitalization was observed for claims on EHE days. For instance, the average hospitalization of cases who were not exposed to EHEs was 0.01 = 1 exp ( 0.011 ) (95% CI 0.01–0.07) times less than that of those who were. In addition, the county-level risk factor, precipitation, was found to have a significant effect on mortality risk only in the case of M3. This finding agrees with our simulation studies, where M3 showed superior performance in parameter estimation compared with the other alternatives in the presence of county-level risk factors. In particular, the average heat prostration hospitalization of men who lived in the higher-humidity area was 1.0 % ( 1 exp ( 0.011 ) (95% CI 0.78–1.5%) more than that of men in lower-humidity areas. We observed similar results while using one-day lag exposure, as shown in Table 5.

5. Conclusions

This study examined various models to determine the optimal approach for analyzing spatiotemporal heterogeneity in workers’ compensation claims data, specifically for heat prostration hospitalizations in New York and Florida. The findings highlight that Model M2, which incorporates both spatial and temporal random effects with time treated as a continuous variable, displayed superior performance in most scenarios, as indicated by low bias and variability in parameter estimates. This model’s accuracy in capturing the complex interactions between risk factors and exposure timing suggests practical value for informing heat-related public health interventions. Additionally, when strong hierarchical structures of risk factors across counties and temporal variability were present, Model M1 proved advantageous, offering flexibility in interpreting localized and temporally specific risks. This study’s use of the deviance information criterion (DIC) further underscored the robustness of these models in selecting appropriate frameworks for accurate, interpretable results.
Previous research on spatial and temporal modeling has largely focused on aggregate data analyses, often at the county or administrative level, or has incorporated either spatial or temporal correlations without adequately capturing both. This study advances current methodologies by integrating spatiotemporal dependencies within a Bayesian hierarchical framework, thereby addressing a critical gap for localized, population-level analyses of environmental exposures in health data. Unlike traditional methods, which may miss nuanced exposure–response relationships in ecological health studies, this approach leverages random effects for both spatial and temporal variability, aligning with emerging trends favoring multilevel Bayesian models in environmental epidemiology. By using continuous-time treatment, this study enhances upon previous static models, providing a nuanced understanding of variable exposure impacts across regions and time frames.
The study has several limitations. First, it does not address potential lag effects of extreme heat events (EHEs), which may impact health outcomes over delayed intervals. Additionally, the hierarchical structure of the data—although considered in the modeling—may still present challenges due to the potential misclassification of risk factors across population clusters. Another limitation is the computational intensity of Bayesian approaches, particularly with high-dimensional data and complex parameter structures, which can introduce longer processing times and potential convergence issues. Furthermore, the study’s reliance on workers’ compensation data, while valuable for heat prostration insights, may not capture broader population-level trends in heat-related hospitalizations. Although the exposure metric reflects the frequency of extreme events and is relevant in the context of climate change, we acknowledge the potential for misclassification of outcomes in claims data.
Future research could explore models that incorporate lag effects, such as distributed lag models (DLMs) or spatially varying Gaussian process models for critical windows (SpGPCWs), to better capture the delayed impacts of EHEs on health outcomes. Additionally, the adoption of advanced algorithms like blocked Gibbs sampling or slice sampling within the Bayesian framework could help reduce computational demands, making these models more accessible for extensive data applications. Research expanding this approach to other health conditions or datasets, like the Surveillance, Epidemiology, and End Results (SEER) program or the National Program of Cancer Registries, could provide broader applications for spatiotemporal modeling in health outcomes associated with environmental exposures. These developments could significantly improve risk prediction and public health planning in response to environmental stressors.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/stats7040080/s1: Table S1: Coding Schemes used by Florida and New York.

Author Contributions

Conceptualization, M.L. and M.W.; methodology, M.L. and M.W.; software, M.L., L.Z. and Z.L.; validation, M.L., L.Z. and Z.L.; formal analysis, M.L.; data curation, M.L.; writing—original draft preparation, M.L.; writing—review and editing, M.W.; supervision, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Zheng Li was employed by the company Novartis Pharmaceuticals. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Bayesian Inference with the MCAR Prior

When D ω C is non-singular, the density function of Δ is well defined, allowing for a unique solution based on the likelihood approach. However, if D ω C is singular, the density function is not well defined, making Bayesian methods more suitable for parameter estimation and inference. Let D = { Y i j k , δ i j k } , X = { x i j k } , Z = { z i j k } ; i = 1 , , I ; j = 1 , , J ; k = 1 , , K i j . Then, we can derive the posterior distribution of Λ as follows:
Λ | D , β , Δ Wishart ( p + I , ( R + V 1 ) 1 ) ,
where the element in the i th row and j th column of V is V i j = Δ i * T ( D w C ) Δ j * , with Δ i * = ( Δ 1 i , , Δ I i ) T and Δ j * = ( Δ 1 j , , Δ I j ) T . The Gibbs sampler algorithm is used to generate the posterior samples of β , ξ , μ , and Λ [33]. In particular, for the t th iteration, t = 1 , , M (where M is the total number of samples we will draw from the posterior distribution), we have the following:
Step 1:
Sample β ( t ) from Pr ( β ( t ) | D , X , Z , ξ ( t 1 ) , Δ ( t 1 ) , Λ ( t 1 ) ) ;
Step 2:
Sample ξ ( t ) from Pr ( ξ ( t ) | D , X , Z , β ( t ) , Δ ( t 1 ) , Λ ( t 1 ) ) ;
Step 3:
Sample ω i ( t ) from Pr ( Δ i ( t ) | D , X , Z , β ( t ) , ω ( i ) ( t 1 ) , Λ t 1 ) , for i = 1 , , I ;
Step 4:
Sample Λ ( t ) from Pr ( Λ ( t ) | D , X , Z , β ( t ) , ξ ( t ) , Δ ( t ) ) .
Of note, there is no closed form for the full conditional posterior distribution except in step 4, which is a Wishart distribution, Wishart ( p + I , ( R + V 1 ) 1 ) . The Metropolis–Hastings algorithm was used to sample the parameters from their full conditional distribution [34]. Taking Pr ( β ( t ) | D , X , Z , ξ ( t 1 ) , Δ ( t 1 ) , Λ ( t 1 ) ) as an example, we have the following procedures:
  • Generate U U n i f ( 0 , 1 ) and W N ( 0 , 1 ) ;
  • Generate β N e w from β N e w = β ( t 1 ) + s W , where s is the step size of a random walk process;
  • Calculate
    L R = L ( D | β N e w , Δ ) Pr ( β N e w ) L ( D | β ( t 1 ) , Δ ) Pr ( β ( t 1 ) ) ;
  • If L R > U , β ( t ) = β N e w ; otherwise, β ( t ) = β ( t 1 ) .
In this context, the value of s is selected to maintain the rejection rate within acceptable limits, targeting an acceptance rate between 10% and 65%. In our simulations, the rejection rate is approximately 46%. We continuously sample β , ξ , μ and Λ from steps 1 to 4 until we acquire a sufficient number of posterior samples. After obtaining a total of M posterior samples, we estimate the parameters by calculating their posterior means and conducting further inference. To ensure the identifiability of the model parameters, Δ i j is centered so that $ i j Δ i j = 0 .

Appendix B. Computing Package

The computing package that includes the code for implementing spatial–temporal Bayesian models is detailed in this article. Developed in R with callable C functions to enhance computational efficiency, the package is publicly available on GitHub. For access, please visit the website at https://github.com/Maggie8888/BayesianST (accessed on 14 October 2024).

Appendix C. Additional Simulation Results

We conducted further checks on the performance of our proposal by considering the other scenarios and models. The data were generated using the following models:
  • Scenario 3 (S3): log { p i j k 1 p i j k } = x 1 , i j k + x 2 , i j k + ω i + 0.5 ( j 1 ) ;
  • Scenario 4 (S4): log { p i j k 1 p i j k } = x 1 , i j k + x 2 , i j k + ω i + 0.5 ( j 1 ) 0.5 .
And we fit the simulated data with additional M3, which is expressed as
  • M3: log { p i j k 1 p i j k } = β 1 x 1 , i j k + β 2 x 2 , i j k + ω i .
The results are summarized in Table A1:
Table A1. Summary of the estimation results for the scenarios. Par: parameters; Bias: the bias of Monte Carlo average of parameter estimates; SD: Monte Carlo standard deviation of parameter estimates; MSE: mean squared error.
Table A1. Summary of the estimation results for the scenarios. Par: parameters; Bias: the bias of Monte Carlo average of parameter estimates; SD: Monte Carlo standard deviation of parameter estimates; MSE: mean squared error.
S1 S2 S3 S4
Par Bias SD MSE Bias SD MSE Bias SD MSE Bias SD MSE
M1 β 1 = 1 −0.0100.0120.001−0.0040.0120.0000.0000.0110.0000.0000.0100.000
β 2 = 1 −0.0050.0110.001−0.0010.0220.0000.0000.0200.0000.0000.0150.000
M2 β 1 = 1 −0.0540.0920.012−0.0030.0120.000−0.0330.0140.001−0.0090.0110.000
β 2 = 1 −0.0230.0320.003−0.0020.0220.001−0.0210.0240.001−0.0040.0200.000
M3 β 1 = 1 −0.1250.1400.035−0.0130.0150.000−0.0320.0170.001−0.0160.0150.000
β 2 = 1 −0.0620.0820.010−0.0070.0270.001−0.0150.0310.001−0.0080.0260.001

References

  1. Lin, S.; Luo, M.; Walker, R.J.; Liu, X.; Hwang, S.A.; Chinery, R. Extreme High Temperatures and Hospital Admissions for Respiratory and Cardiovascular Diseases. Epidemiology 2009, 20, 738–746. [Google Scholar] [CrossRef] [PubMed]
  2. Linares, C.; Culqui, D.; Carmona, R.; Ortiz, C.; Díaz, J. Short-term association between environmental factors and hospital admissions due to dementia in Madrid. Environ. Res. 2017, 152, 214–220. [Google Scholar] [CrossRef] [PubMed]
  3. Richardson, S.; Best, N. Bayesian hierarchical models in ecological studies of health–environment effects. Environmetrics Off. J. Int. Environmetrics Soc. 2003, 14, 129–147. [Google Scholar] [CrossRef]
  4. Sheppard, L. Insights on bias and information in group-level studies. Biostatistics 2003, 4, 265–278. [Google Scholar] [CrossRef]
  5. Richmond-Bryant, J.; Long, T.C. Influence of exposure measurement errors on results from epidemiologic studies of different designs. J. Exposure Sci. Environ. Epidemiol. 2020, 30, 420–429. [Google Scholar] [CrossRef] [PubMed]
  6. Dominici, F.; Zeger, S.L.; Samet, J.M. A measurement error model for time-series studies of air pollution and mortality. Biostatistics 2000, 1, 157–175. [Google Scholar] [CrossRef]
  7. Jerrett, M.; Burnett, R.T.; Kanaroglou, P.; Eyles, J.; Finkelstein, N.; Giovis, C.; Brook, J.R. A gis–environmental justice analysis of particulate air pollution in hamilton, canada. Environ. Plan. A 2001, 33, 955–973. [Google Scholar] [CrossRef]
  8. Jerrett, M.; Arain, A.; Kanaroglou, P.; Beckerman, B.; Potoglou, D.; Sahsuvaroglu, T.; Morrison, J.; Giovis, C. A review and evaluation of intraurban air pol- lution exposure models. J. Exposure Sci. Environ. Epidemiol. 2005, 15, 185–204. [Google Scholar] [CrossRef]
  9. Steinle, S.; Reis, S.; Sabel, C.E. Quantifying human exposure to air pol- lution—moving from static monitoring to spatio-temporally resolved personal exposure assessment. Sci. Total Environ. 2013, 443, 184–193. [Google Scholar] [CrossRef]
  10. Goldman, G.T.; Mulholland, J.A.; Russell, A.G.; Srivastava, A.; Strickl, M.J.; Klein, M.; Waller, L.A.; Tolbert, P.E.; Edgerton, E.S. Ambient air pollutant measurement error: Characterization and impacts in a time-series epidemiologic study in atlanta. Environ. Sci. Technol. 2010, 44, 7692–7698. [Google Scholar] [CrossRef]
  11. Turner, M.G.; Cardille, J.A. Spatial heterogeneity and ecosystem processes. In Key Topics in Landscape Ecology; Wu, J., Hobbs, R.J., Eds.; Cambridge Studies in Landscape Ecology; Cambridge University Press: Cambridge, UK, 2007; pp. 62–77. [Google Scholar]
  12. Chang, H.H.; Warren, J.L.; Darrow, L.A.; Reich, B.J.; Waller, L.A. Assessment of critical exposure and outcome windows in time-to-event analysis with application to air pollution and preterm birth study. Biostatistics 2015, 16, 509–521. [Google Scholar] [CrossRef] [PubMed]
  13. Warren, J.; Fuentes, M.; Herring, A.; Langlois, P. Spatial-temporal modeling of the association between air pollution exposure and preterm birth: Identifying critical windows of exposure. Biometrics 2012, 68, 1157–1167. [Google Scholar] [CrossRef] [PubMed]
  14. Warren, J.; Fuentes, M.; Herring, A.; Langlois, P. Bayesian spatial-temporal model for cardiac con- genital anomalies and ambient air pollution risk assessment. Environmetrics 2012, 23, 673–684. [Google Scholar] [CrossRef]
  15. Banerjee, S.; Carlin, B.P.; Gelfand, A.E. Hierarchical Modeling and Analysis for Spatial Data; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
  16. Heaton, M.J.; Datta, A.; Finley, A.O.; Furrer, R.; Guinness, J.; Guhaniyogi, R.; Gerber, F.; Gramacy, R.B.; Hammerling, D.; Katzfuss, M.; et al. A case study competition among methods for analyzing large spatial data. J. Agric. Biol. Environ. Statist 2019, 24, 398–425. [Google Scholar] [CrossRef]
  17. Zhang, Y.; Ebelt, S.T.; Shi, L.; Scovronick, N.C.; D’Souza, R.R.; Steenland, K.; Chang, H.H. Short-term associations between warm-season ambient temperature and emergency department visits for Alzheimer’s disease and related dementia in five US states. Environ. Res. 2023, 220, 115176. [Google Scholar] [CrossRef]
  18. Fritze, T. The Effect of Heat and Cold Waves on the Mortality of Persons with Dementia in Germany. Sustainability 2020, 12, 3664. [Google Scholar] [CrossRef]
  19. Armstrong, B.G. Fixed Factors that Modify the Effects of Time-Varying Factors: Applying the Case-Only Approach. Epidemiology 2003, 14, 467–472. [Google Scholar] [CrossRef] [PubMed]
  20. Schwartz, J. Who is Sensitive to Extremes of Temperature?: A Case-Only Analysis. Epidemiology 2005, 16, 67–72. [Google Scholar] [CrossRef]
  21. Zanobetti, A.; O’Neill, M.S.; Gronlund, C.J.; Schwartz, J.D. Susceptibility to Mortality in Weather Extremes: Effect Modification by Personal and Small-Area Characteristics. Epidemiology 2013, 24, 809–819. [Google Scholar] [CrossRef]
  22. Xu, Z.; Crooks, J.L.; Black, D.; Hu, W.; Tong, S. Heatwave and infants’ hospital admissions under different heatwave definitions. Environ. Pollut. 2017, 229, 525–530. [Google Scholar] [CrossRef]
  23. Madrigano, J.; Ito, K.; Johnson, S.; Kinney, P.L.; Matte, T. A Case-Only Study of Vulnerability to Heat Wave–RelatedMortality in New York City (2000–2011). Environ. Health Perspect. 2015, 123, 672–678. [Google Scholar] [CrossRef] [PubMed]
  24. Wang, M.; Li, Z.; Lu, J.; Zhang, L.; Li, Y.; Zhang, L. Spatial-temporal Bayesian accelerated failure time models for survival endpoints with applications to prostate cancer registry data. BMC Med. Res. Methodol. 2024, 24, 86. [Google Scholar] [CrossRef] [PubMed]
  25. Carracedo-Martínez, E.; Taracido, M.; Tobias, A.; Saez, M.; Figueiras, A. Case-Crossover Analysis of Air Pollution Health Effects: A Systematic Review of Methodology and Application. Environ. Health Perspect. 2010, 118, 1173–1182. [Google Scholar] [CrossRef]
  26. Szyszkowicz, M. Case-Crossover Method with a Short Time-Window. Int. J. Environ. Res. Public Health 2020, 17, 202. [Google Scholar] [CrossRef] [PubMed]
  27. Braeye, T.; Hens, N. Optimising the case-crossover design for use in shared exposure settings. Epidemiol Infect 2020, 148, e151. [Google Scholar] [CrossRef]
  28. Janes, H.; Sheppard, L.; Lumley, T. Case-Crossover Analyses of Air Pollution Exposure Data:Referent Selection Strategies and Their Implications for Bias. Epidemiology 2005, 16, 717–726. [Google Scholar] [CrossRef]
  29. Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B (Methodol.) 1974, 36, 192–236. [Google Scholar] [CrossRef]
  30. Hanson, T.E.; Jara, A.; Zhao, L. A Bayesian semiparametric temporally-stratified proportional hazards model with spatial frailties. Bayesian Anal. 2011, 6, 1–48. [Google Scholar] [CrossRef]
  31. Cai, B.; Lawson, A.B.; Hossain, M.; Choi, J.; Kirby, R.S.; Liu, J. Bayesian semiparametric model with spatially-temporally varying coefficients selection. Stat. Med. 2013, 32, 3670–3685. [Google Scholar] [CrossRef]
  32. Spiegelhalter, D.J.; Best, N.G.; Carlin, B.P.; Van Der Linde, A. Bayesian measures of model complexity and fit. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2002, 64, 583–639. [Google Scholar] [CrossRef]
  33. Casella, G.; George, E.I. Explaining the Gibbs sampler. Am. Stat. 1992, 46, 167–174. [Google Scholar] [CrossRef]
  34. Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings algorithm. Am. Stat. 1995, 49, 327–335. [Google Scholar] [CrossRef]
Figure 1. Case-crossover (CCO) design in our study.
Figure 1. Case-crossover (CCO) design in our study.
Stats 07 00080 g001
Table 1. Summary of the estimation results for the scenarios. Par: parameters; Bias: the bias of Monte Carlo average of parameter estimates; SD: Monte Carlo standard deviation of parameter estimates; MSE: mean squared error.
Table 1. Summary of the estimation results for the scenarios. Par: parameters; Bias: the bias of Monte Carlo average of parameter estimates; SD: Monte Carlo standard deviation of parameter estimates; MSE: mean squared error.
S1 S2
Par Bias SD MSE Bias SD MSE
M1 β 1 = 1 −0.0100.0120.001−0.0040.0120.000
β 2 = 1 −0.0050.0110.001−0.0010.0220.000
M2 β 1 = 1 −0.0540.0920.012−0.0030.0120.000
β 2 = 1 −0.0230.0320.003−0.0020.0220.001
Table 2. Summary of the estimation results for the scenarios with the county-level risk factor x 1 , i . Par: parameters; Bias: the bias of Monte Carlo average of parameter estimates; SD: Monte Carlo standard deviation of parameter estimates; MSE: mean squared error.
Table 2. Summary of the estimation results for the scenarios with the county-level risk factor x 1 , i . Par: parameters; Bias: the bias of Monte Carlo average of parameter estimates; SD: Monte Carlo standard deviation of parameter estimates; MSE: mean squared error.
S1 S2
Par Bias SD MSE Bias SD MSE
M1 β 1 = 1 −0.0840.1310.024−0.0200.0800.007
β 2 = 1 −0.0060.0300.001−0.0020.0230.001
M2 β 1 = 1 −0.0920.1480.030−0.0090.1010.010
β 2 = 1 −0.0450.0630.006−0.0020.0230.001
Table 3. Summary results of the selection probabilities for different models under different scenarios. x 1 , i j k is a subject-level risk factor varied across counties and temporal years; x 1 , i is a county-level risk factor.
Table 3. Summary results of the selection probabilities for different models under different scenarios. x 1 , i j k is a subject-level risk factor varied across counties and temporal years; x 1 , i is a county-level risk factor.
x 1 , ijk x 1 , i
M1 M2 M1 M2
S10.930.070.910.09
S20.100.900.220.78
Table 4. Result summary for the data application of the heat prostration hospitalization using workers’ compensation claims data under M1 and M2 models. EST: parameter estimate; CL: credible limit.
Table 4. Result summary for the data application of the heat prostration hospitalization using workers’ compensation claims data under M1 and M2 models. EST: parameter estimate; CL: credible limit.
CovariateNew YorkFlorida
M1 M2 M1 M2
EST (95%CL) EST (95%CL) EST (95%CL) EST (95%CL)
Age
> = 65REFREFREFREF
<650.039 (0.035, 0.041)0.040 (0.036, 0.043)
Gender
MaleREFREFREFREF
Female0.057 (0.009, 0.097)−0.092 (−0.155, −0.013)
EHE
YesREFREFREFREF
No−0.011 (−0.015, −0.007)−0.022 (−0.029, −0.012)−0.012 (−0.017, −0.004)−0.018 (−0.023, −0.011)
Year(Cont)0.044 (0.039, 0.052)-0.075 (0.068, 0.086)-
Year = 2010-REF-REF
Year = 2011-−0.441 (−1.663, −0.116)-−0.036 (−0.107, 0.078)
Year = 2012-0.134 (−0.067, 0.143)-−0.114 (−0.157, 0.022)
Year = 2013-0.192 (−0.097, 0.331)-−0.112 (−0.165, 0.037)
Year = 2014-−0.429 (−0.877, −0.121)-−0.103 (−0.165, 0.050)
Year = 2015-0.095 (−0.570, 0.781)--0.053 (−0.159, 0.110)
Year = 2016-0.416 (−0.026, 0.741)-0.052 (−0.030, 0.191)
Year = 2017-1.526 (0.761, 1.891)-0.277 (0.193, 0.407)
Year = 2018-1.495 (0.754, 1.901)-0.267 (0.193, 0.403)
Year = 2019-2.214 (1.514, 2.996)-0.619 (0.536, 0.765)
Year = 2020-5.375 (2.499, 7.816)-1.271 (1.018, 1.553)
Model Diagnosis
DIC61,202.3169,965.2034,235.6429,876.01
Table 5. Result summary for the data application of the heat prostration hospitalization using workers’ compensation claims data under M1 and M2 models for lag one EHEs. EST: parameter estimate; CL: credible limit.
Table 5. Result summary for the data application of the heat prostration hospitalization using workers’ compensation claims data under M1 and M2 models for lag one EHEs. EST: parameter estimate; CL: credible limit.
CovariateNew YorkFlorida
M1 M2 M1 M2
EST (95%CL) EST (95%CL) EST (95%CL) EST (95%CL)
Age
> = 65REFREFREFREF
<650.038 (0.034, 0.040)0.041 (0.035, 0.042)
Gender
MaleREFREFREFREF
Female0.056 (0.007, 0.098)−0.090 (−0.154, −0.015)
EHE
YesREFREFREFREF
No−0.013 (−0.016, −0.008)−0.023 (−0.029, −0.013)−0.014 (−0.013, −0.004)−0.018 (−0.023, −0.011)
Year(Cont)0.046 (0.037, 0.052)-0.077 (0.066, 0.087)-
Year = 2010-REF-REF
Year = 2011-−0.443 (−1.661, −0.116)-−0.036 (−0.107, 0.078)
Year = 2012-0.135 (−0.063, 0.144)-−0.114 (−0.157, 0.022)
Year = 2013-0.193 (−0.096, 0.331)-−0.113 (−0.165, 0.037)
Year = 2014-−0.431 (−0.876, −0.121)-−0.104 (−0.165, 0.050)
Year = 2015-0.096 (−0.570, 0.781)-−0.051 (−0.154, 0.110)
Year = 2016-0.416 (−0.026, 0.741)-0.052 (−0.030, 0.191)
Year = 2017-1.530 (0.777, 1.891)-0.277 (0.193, 0.407)
Year = 2018-1.496 (0.754, 1.901)-0.261 (0.193, 0.403)
Year = 2019-2.213 (1.514, 2.996)-0.620 (0.536, 0.765)
Year = 2020-5.374 (2.499, 7.816)-1.272 (1.018, 1.553)
Model Diagnosis
DIC61,221.3169,966.2034,235.3329,876.00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, M.; Li, Z.; Zhang, L.; Wang, M. A Spatial–Temporal Bayesian Model for a Case-Crossover Design with Application to Extreme Heat and Claims Data. Stats 2024, 7, 1379-1391. https://doi.org/10.3390/stats7040080

AMA Style

Liang M, Li Z, Zhang L, Wang M. A Spatial–Temporal Bayesian Model for a Case-Crossover Design with Application to Extreme Heat and Claims Data. Stats. 2024; 7(4):1379-1391. https://doi.org/10.3390/stats7040080

Chicago/Turabian Style

Liang, Menglu, Zheng Li, Lijun Zhang, and Ming Wang. 2024. "A Spatial–Temporal Bayesian Model for a Case-Crossover Design with Application to Extreme Heat and Claims Data" Stats 7, no. 4: 1379-1391. https://doi.org/10.3390/stats7040080

APA Style

Liang, M., Li, Z., Zhang, L., & Wang, M. (2024). A Spatial–Temporal Bayesian Model for a Case-Crossover Design with Application to Extreme Heat and Claims Data. Stats, 7(4), 1379-1391. https://doi.org/10.3390/stats7040080

Article Metrics

Back to TopTop