1. Introduction
Data can be reported at different levels due to various considerations including economic, confidentiality, and data collection difficulty. For example, the US Census Bureau reports income at the household level. The aggregate-level data in this example are household income, which is a measure of the combined incomes of all people sharing a particular household or place of residence. The individual-level data in this example are individuals’ incomes. The aggregate-level data are defined as data aggregated from individual-level data by groups. Although there are risks in estimating individual-level relationships based on aggregate-level data, such as unequal correlations between variables in aggregate-level data and between the same variables in individual-level data [
1,
2], researchers continue to use aggregate-level data because in many situations, individual-level data are not available and valid methods for estimating individual-level relationships based on aggregate-level data can be derived [
1,
3]. The terms “individual” and “aggregate” refer to the different levels and units of analysis [
1].
This article intends to solve the problem of estimating models describing an individual-level relationship based on an aggregate-level response variable Y and individual-level predictors X. Examples of data situations include survey data, multivariate time series, social data, and biological data, collected and reported at different levels.
Our interest in developing methods to analyze aggregate data was motivated by real-life examples. One example is group testing of infectious diseases in bio-statistics. To reduce the costs, a two-stage sequential testing strategy is applied. In the first stage, group testing is conducted. Individuals showing positive in the first stage are called back for a follow-up individual test. With the first-stage group testing data available, analyses can be conducted. The second example is consumer demand studies in economics. The consumer’s characteristics data are available at the individual level, whereas the consumer’s purchase data are available only at the aggregate level. The third example is the analysis of multivariate time series. It is likely that different time series are reported at different frequencies. To study the relationships between multiple time series with different frequencies, researchers need to develop statistical methods.
Suppose there are
n observations in the sample,
,
,
,
, aggregated into
K groups,
, with group sizes, respectively, of
,
. Denote the set of observations in Group
g as
. Aggregate-level
X and
Y, i.e.,
,
are
Note that can be any summary statistic calculated from individual-level Y in Group g, and we study summation aggregation in this paper.
Researchers have solved this problem for linear models [
4,
5,
6]. Suppose the linear model describing individual-level data
is
Then, the corresponding model describing the aggregated data
is
where
is the aggregate-level error so that weighted least squares (WLS) can be applied when
[
4]. More estimators have been proposed for linear regression based on aggregate data or partially aggregate data including Palm and Nijman’s MLE estimator [
5] and Rawashdeh and Obeidat’s Bayesian estimator [
6].
Although the estimations of linear regression models in the above data situation have been well studied, more studies are needed for the estimations of other regression models. The aim of this article is to study the estimations of logistic models in the data situation of aggregate-level
Y and individual-level
X. We derive the likelihoods and our estimators with different optimization methods in
Section 2, conduct simulation studies to evaluate and compare the performances of different estimators in
Section 3, illustrate the use of different estimators in real data-based studies in
Section 4, provide discussions in
Section 5, and draw conclusions in
Section 6.
2. Methods
Suppose
n independent observations
are modeled by a logistic model
Then,
, where
When individual-level
X and
Y are both available, the logistic model as a general linear model can be estimated using a range of methods including the Newton–Raphson method and Fisher’s scoring method [
7,
8].
2.1. Likelihood of Aggregate-Level Y and Individual-Level X
When individual-level
Y is not available, we can derive estimators based on aggregate-level
Y and individual-level
X. Suppose the
n observations of
are aggregated into
K groups, as described in the introduction section, with the aggregated data
, defined in Equation (
1).
Aggregate-level
Y is obtained by summing all
Y within each group. Thus, the distribution of the sum of multiple independent random variables is helpful for studying data aggregation. In our logistic regression scenario, we need to calculate the sum of multiple Bernoulli random variables. In statistics, the Poisson binomial distribution is the distribution of a sum of independent Bernoulli random variables, which do not necessarily have different success probabilities [
9,
10]. The term
is used to refer to the distribution of the sum of
n independent Bernoulli random variables with success probabilities
[
9].
Because
is the sum of
independent Bernoulli random variables,
where the success probability for the
ith individual in Group
g is
Denote the individual likelihood for as . Then, the aggregate likelihood .
2.2. Calculation and Maximization of Likelihood
Computing the likelihood function needs to calculate the probability mass function for
. The variable
will reduce to
when
. This case can happen when aggregation is based on the values of
X and the individual-level predictors
are the same within each group. This specific aggregation has been well studied in the topic of logistic regression based on aggregate data [
7,
11]. We consider aggregation not based on
X, i.e., allowing different values of
X in a group, in this paper.
In general, for a variable
, the probability mass function is
, where
is the set of all subsets of
y integers that can be selected from
and
is the complement of
A [
9]. The set
contains
elements so the sum over it is computationally intensive and even infeasible for large
n. Instead, more efficient ways were proposed, including the use of a recursive formula to calculate
based on
,
, which is numerically unstable for large
n [
12], and the inverse Fourier transform method [
13]. Hong [
10] further developed it by proposing an algorithm that efficiently implements the exact formula with a closed expression for the Poisson binomial distribution. We adopted Hong’s algorithm [
10] and exact formula in calculating the likelihood function
since they are more precise and numerically stable.
Commonly used optimization methods were adopted to maximize the likelihood
, including (1) Nelder and Mead’s simplex method (NM) [
14], (2) the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [
15], and (3) the conjugate gradient (CG) method [
16].
2.3. Large-Sample Properties of Estimators
As mentioned above, our proposed estimators are obtained by maximizing the aggregate likelihood using different optimization methods (NM, BFGS, and CG). The MLE is an estimator that maximizes the aggregate likelihood function, i.e., . If our three optimization methods can always obtain the maximizer of , the three estimators will be equal and exactly the same as the MLEs.
In practice, the three optimization methods may not obtain the same value as the MLE. We observed that as the sample size increases, the values obtained using the three optimizations become closer and nearly the same for a large sample size. In discussing large-sample properties, we refer to the scenario of an infinite number of observations and assume that the three optimization methods can always obtain MLEs under the scenario of large samples, i.e., the scenario of an infinite number of observations. Then, our three estimators are identical to the MLE and have the same large-sample properties as the MLE. We add a cautious note that if our estimators are still quite different from the MLE under the large-sample scenario, we cannot state that our estimators have the same large-sample properties as the MLE.
The large-sample properties of the MLE
[
17] include (i) consistency, i.e.,
in probability, and (ii) asymptotic normality, i.e.,
, where
is the expected information matrix, defined as the negative expectation of the second derivative of the log-likelihood. The expected information matrix can be approximated using the observed information matrix, which is the negative of the second derivative (the Hessian matrix) of the log-likelihood function [
17].
2.4. Software Implementation
All analyses in this paper were conducted using R software (version 4.2.0). Multiple R packages were used as follows:
The
package. This package implements multiple exact and approximate methods to calculate Poisson binomial distributions [
10]. We used this package to calculate the Poisson binomial distributions and aggregate likelihood
.
The package. This package contains the function, which can conduct general-purpose optimization based on multiple optimization methods, including the Nelder–Mead, BFGS, and CG methods. We used this function to obtain our three estimators using three optimization methods.
The package. This package can be used to fit generalized linear models including logistic regression. We used this package to conduct logistic regression.
2.5. Computational Burden
The computational burden of our method relies on three factors: (1) p, (2) aggregate-level data sample size K, and (3) group size .
Our estimator for is obtained by maximizing the aggregate likelihood , using three optimization methods (NM, BFGS, and CG). The number of evaluations of the optimization function and the derivatives will increase with respect to an increase in p. Large p will decrease the performance. Given a small fixed number p, the number of function evaluations is . Because , the computational amount for is K times the computational amount for .
The computation of
includes two steps. In Step 1, the success probabilities are calculated using Equation (
4). The computational burden of Step 1 is
. In Step 2, the probability mass for a Poisson binomial random variable described in Equation (
3) is calculated. This step adopts Hong’s Algorithm A, which is an efficient implementation of the discrete Fourier transform of the characteristic function (DFT-CF) of the Poisson binomial distribution [
10]. The computational burden of Step 2 is
. In total, the computational burden of our estimation method is
, given a small constant
p.
3. Simulation Studies
We conducted simulation studies to evaluate the performance of the five estimators. The first estimator, named individual-LR, is the logistic regression estimator based on individual-level X and Y. This estimator is infeasible when only aggregate Y is available. Because aggregate-level Y contains less information compared to individual-level Y, we expect that this infeasible estimator can provide an upper bound for the performance of feasible estimators based on aggregate-level Y. The second estimator, named naive LR, is the logistic regression estimator based on the mean X in each group and the aggregate Y, i.e., . This estimator can provide a rough approximate estimation.
Estimators 3 to 5 are our estimators that maximize the aggregate likelihood using the Nelder–Mead optimization, CG optimization, and BFGS optimization, named aggregate LR with NM, aggregate LR with CG, and aggregate LR based on BFGS, respectively.
The performances of the estimators were compared in three scenarios. In each scenario, simulations were conducted with sample sizes (), equal group sizes (), and different parameter values. Data were generated as follows:
In Scenario 1, , , ,
(Scenario 1A) or (Scenario 1B).
In Scenario 2, , , ,
, (Scenario 2A) or
(Scenario 2B).
In Scenario 3, , , , ,
(Scenario 3A) or (Scenario 3B).
The bias, variance, mean square error (MSE), and mean absolute deviation (MAD) of each of the five estimators’ (E1 to E5) model parameters
were calculated. Denote the bias, variance, MSE, and MAD of the
q-th estimator of
as
,
,
, and
. The average squared bias, variance, MSE, and MAD of the
qth estimator were calculated as
Please note that we averaged over the squared bias instead of the bias because the positive bias and negative bias can cancel out when averaging the bias. The average across the parameters allows us to obtain the average performance in terms of the squared bias, variance, MSE, and MAD and still maintain the equality of the bias, variance, and MSE, i.e.,
In
Table 1, we report the average squared biases and variances for the five estimators (E1 to E5) under the different scenarios, sample sizes
K, and aggregation sizes
. As we expected, there was a relatively large bias for the naive estimator E2, which used an approximate likelihood by conducting logistic regressions using the average
X. Our estimators (E3 to E5) had relatively small biases because these estimators were working on the correct and exact likelihood functions. The first estimator E1 had the smallest bias by working on individual-level
X and individual-level
Y. This estimator is widely used when individual-level
Y is available. However, under the scenario we intended to solve, only aggregate-level
Y was available. Thus, the E1 estimator is infeasible. We still report the performance of E1 to provide some measurements of the possible upper bound of the performance. Because data aggregation will discard information, we expect that estimator E1 will generally perform better than the estimators based on aggregate
Y.
Next, we check the variances of all five estimators. The variances of all five estimators were similar in the same magnitude level. There was no estimator that performed uniformly better or even generally better than the other estimators. The naive estimator E2 had similar performance or even slightly better performance in the average variance compared with the other estimators (E1, E3–E5). Our estimators (E3 to E5) were slightly worse in terms of variance. We think the slightly worse performance of our estimators (E3–E5) was likely due to the nonlinear optimization to find the MLE in our estimators. In comparison, the logistic regression estimators (E1 and E2) were calculated using iteratively re-weighted least squares (IRLS) (logistic regression ensures global concavity so that it is simpler to find the MLE), which was numerically more stable compared to the nonlinear optimization of a general likelihood function using (1) Nelder and Mead’s simplex method [
14], (2) the BFGS method [
15], and (3) the conjugate gradient (CG) method [
16].
We point out that although the naive estimator E2 worked on an incorrect (or approximate) likelihood function, which can lead to a large bias due to the incorrect likelihood function, the performance of the variance of E2 did not necessarily become worse. A similar phenomenon was the under-fitting in the data analysis. Suppose the true relationship is a quadratic function. If a linear function is used in model fitting, the estimator will have a large bias due to model mis-specification, whereas the variance may not increase. We note that the main disadvantage of estimator E2 was the use of an incorrect or approximate likelihood function, which can lead to a large bias. Using the correct exact likelihood, i.e., our estimators (E3 to E5), can solve the issue of bias due to the slight increase in variance from the switch in finding the MLE from iteratively reweighted least squares (IRLS) to nonlinear optimization using the Nelder and Mead’s simplex, BFGS, and CG methods. We compared the decrease in bias and increase in variance and think the bias reduction will dominate the variance increase in our estimators. We calculated the overall performance in terms of the MSE and MAD to confirm it.
Our simulation results showed that the naive estimator had a large bias due to the use of an incorrect or approximate likelihood function, which can hurt the MSE. Thus, in
Table 2, we report the average performance of the five estimators (E1 to E5) in terms of the MSE and MAD. Our simulation results indicated that our proposed estimators (E3 to E5) were better than the naive LR estimator (E2). As expected, the infeasible estimator (E1) based on individual-level
Y performed better than the other four feasible estimators (E2 to E5) based on aggregate-level
Y due to the loss of information in the data aggregation. Our estimator based on Nelder and Mead’s simplex optimization (E3) performed slightly better than our estimator based on BFGS optimization (E4) and CG optimization (E5).
We found the performances of our estimators (E3, E4, E5) were slightly worse when the group size compared with the performances of our estimators when the group size . We expect that the performance of our estimators may decrease for a large group size due to rounding errors in computation.
4. Real Data-Based Studies
We used real data to illustrate the use of our estimators and compare the different estimators. The dataset used was the “Social-Network-Ads” dataset from the Kaggle Machine Learning Forum (
https://www.kaggle.com, accessed on 12 January 2023).
The dataset has been used by statisticians and data scientists to illustrate the use of logistic regression in categorical data analysis. We used the dataset to illustrate the use of our method to conduct logistic regression in the presence of data aggregation.
The Social-Network-Ads dataset in Kaggle is a categorical dataset for determining whether a user purchased a particular product. The dataset (
https://www.kaggle.com/datasets, accessed on 12 January 2023) contains 400 people/observations. The information about the person’s purchase action (purchased with a binary variable of 1 denotes purchased and 0 denotes not purchased), as well as the person’s age and estimated salary, is provided. Logistic regression has been recommended in Kaggle to model the person’s purchase action based on the person’s age and estimated salary. We intend to apply our method to this dataset in the presence of data aggregation.
The original dataset is at the individual level, which allows us to conduct logistic regression based on individual-level Y and X. We standardized X by in data pre-processing. Standardization of X allows for better estimation and interpretation. Standardized coefficients are obtained by logistics regression of Y on standardized data . The original slope coefficients in can be calculated by the formula and then the intercept coefficient can be calculated.
We imposed data aggregation on this dataset with an aggregation size . We randomly divided the persons into groups of size and calculated the group aggregate of the purchase actions Y. Due to confidentiality and the cost of collecting individual-level data, businesses and organizations can choose to post data information at an aggregate level. We mimicked the data aggregation process by random grouping and calculated the aggregate-level Y based on the individual-level Y. We repeated the data aggregation 300 times. In this way, we generated 300 datasets, with the individual-level X and aggregate-level Y calculated.
For each dataset, we conducted logistic regression based on individual-level X and Y and obtained our estimator E1. Since data aggregation discards information, we evaluated the other estimators by checking whether they were close to estimator E1. Because the true values of the coefficients in individual-level logistic regression models are not known in real data-based studies, we used estimator E1 as a gold-standard estimator. We compared the other estimators based on aggregate-level Y to determine which estimator was closer to our gold-standard estimator E1. Note that E1 is an infeasible estimator when individual-level X is not available.
The estimator E1 was calculated based on individual-level X and individual-level Y. The estimated value of estimator E1 remained the same in our 300 generated datasets and E1 was treated as the gold-standard estimator; thus, we denote it as .
Denote the estimated value of
for the
j-th estimator in the
k-th dataset by
. The bias, variance, MSE, and MAD of estimators E2 to E5 for
,
, and
were calculated by the formulae
For the four estimators based on aggregate-level Y and individual-level X, i.e., E2 to E5, we report the biases and variances in
Table 3. We can see that in most cases, there are large biases in estimating
and
and relatively smaller biases in estimating
using the naive estimator E2. Our proposed estimators (E3 to E5) always achieved smaller biases compared to the naive estimator E2. This is because the naive estimator E2 used an approximate likelihood instead of an exact likelihood, which our proposed estimators are based on. In terms of variance, the naive estimator had a relatively smaller variance compared with our estimators E3 to E5. We point out that the calculation algorithm used in E2, i.e., iteratively reweighted least squares (IRLS), was more numerically stable compared with the nonlinear optimization algorithms adopted by our estimators, i.e., Nelder and Mead’s simplex method, the BFGS method, and the conjugate gradient method.
We then checked the overall performance of the different estimators and report the MSE and MAD in
Table 4. We found that our estimators (E3 to E5) had better performance than the naive estimator (E2) in terms of the MSE and MAD in all situations based on the Social-Network-Ads dataset.
5. Discussion
Our estimators are obtained by maximizing the nonlinear likelihood function , . Different optimization methods can influence the performance of our estimators. Further studies can be conducted on other optimization methods such as the genetic algorithm or using multiple starting values. The performance of optimization is expected to decrease when p increases.
We only consider independent individual-level data, i.e., , . The n observations are randomly divided into groups of size and the aggregate-level Y is calculated after grouping. In this paper, we only consider the situation of “grouping completely at random”, which means that the grouping mechanism is completely random. The values of X and Y do not influence the grouping. Further studies can be conducted beyond this type of grouping mechanism.
Our aggregation scheme is based on independent individual-level data. There are more aggregations schemes. For example, temporal aggregation can aggregate dependent data, which can generate aggregated low-frequency time series based on high-frequency time series by summing every
m consecutive time points. For example, we can aggregate daily time series into weekly time series by summing every
consecutive daily observations. Temporal aggregation is often based on a time series model such as an integer-valued generalized autoregressive conditional heteroskedasticity (INGARCH) [
18].
We note that the proposed methods also allow for other link functions in addition to the logit link. For example, when a probit link function is used, we can estimate individual-level probit models based on aggregate-level Y and individual-level X. In addition, we only consider binary responses in this paper. A follow-up study to extend our methods to handle responses with more than two levels are under development.
6. Conclusions
We proposed methods to estimate logistic models based on individual-level predictors and aggregate-level responses. We conducted simulation studies to evaluate the performance of the estimators and show the advantage of our estimators. We then used the Social-Network-Ads dataset to illustrate the use of our estimators in the presence of data aggregation and compared the different estimators. Both the simulation studies and real data-based studies have shown the advantage of our estimators in estimating logistics models describing individual-level behaviors based on aggregate-level Y and individual-level X, i.e., when there is data aggregation in the response variable.