2.1. Generalities
Let us consider the case of the estimation of a linear regression when the true individuals’ coordinates are not disclosed for confidentiality and a distance from a relevant point is used as a predictor. For instance, in health economics it is common practice to postulate a relationship between a health outcome for each individual (such as the effect of a health policy), say
y, and the individual’s distance (say
d) from a clinic or a hospital. To illustrate the essence of the problem we will restrict ourselves to the case, admittedly unrealistic, of a simple regression without any further predictors. Furthermore, to simplify the algebra without compromising the generality of the results, we will postulate a relationship between the health outcome and the squared distance from a relevant point. When data are geo-masked, the distance between each individual and a conspicuous point will be upward biased (as shown, e.g., by Arbia
et al., 2015 [
5] and Elkies,
et al., 2015 [
6]). This paper shows that, when the masking procedure is disclosed, this information can be taken into account for the benefit of the analysis.
To start with, let us recall that the classical error measurement theory (e.g., [
7]) defines the true model as:
for each individual observed in the point of coordinates (
i,
j), with
the distance between point (
i,
j) and the point of interest and
. The distance is observed with an error due to geo-masking and the measurement error is defined as:
with
the squared distance observed after geo-masking. Following the classical theory, as it is known,
uij should be assumed to be such that:
,
constant, and
uij independent of
vij and of
. Normality of
u is also often assumed. In these conditions, having called
the OLS estimator of β, such estimator will be still unbiased, but less efficient, since its variance can now be expressed as:
The estimator will also be inconsistent with a downward asymptotic bias towards zero (called
attenuation) quantified by the expression [
7]:
with
the variance of the squared uncontaminated distances.
However, in the case of a measurement error induced by geo-masking, the results are quite different from the classical, as we will illustrate in the next sections.
2.2. Gaussian Geo-Masking within a Circle
Let us start considering the effect of geo-masking when using a distance as a regressor in an econometric model, in the case of Gaussian geo-masking, that is when the true individuals’ location is perturbated with a bivariate Gaussian distribution centered on the true point. More formally let us consider the point of coordinates (
i,
j) and let us geo-mask this point by disclosing, instead, the coordinates
and
with
. Let us further consider, for each individual point, its distance from a conspicuous point that, without loss of generalities, we can allocate at the origin of the Cartesian system. In this case the true squared distance of the point of coordinates (
i,
j) from the conspicuous point is given by:
while, after geo-masking we observe instead
because of our definitions.
So the term
u defines the measurement error on the independent variable of the model as in the classical theory. However, in contrast with the classical theory, in this case we have:
and
Thus the measurement error has non-zero mean and non-constant variances. (See
Appendix A for the proof). The non-zero mean does not affect the point estimate of the parameter β, but only the constant term. Equation (6) shows that the procedure of geo-masking also induces heteroscedasticity.
Furthermore from Equation (4) we have:
As a consequence (since
,
i and
j are constant terms and
)
is a non-central Chi-squared with 2 degrees of freedom and non-centrality parameter
. The proof is left to
Appendix B.
Following the classical theory, the OLS estimator will be less efficient and inconsistent recalling Equations (2) and (3). In particular, in the case of Gaussian geo-masking, the variance of the OLS estimator will be:
Thus, the larger are
and the larger the distance from the conspicuous point is, the lower the precision of the estimate will be. The precision also depends on the square of the true value of β. Furthermore, using Equation (6) to evaluate the attenuation, we have:
which shows that the attenuation effect on the OLS estimator is greater in the presence of a larger geo-masking variance of higher distances.
In practical cases, to communicate with practitioners, it is useful to introduce the Gaussian geo-masking mechanism with reference to a maximum displacement distance which is easier to interpret than a variance for non-specialists. Since in a Gaussian distribution
, with a probability close to 1 we can assume that the maximum displacement distance is 3σ. If we call θ* such maximum distance, we have that 3σ = θ*. So the expected measurement error is
and the bias can be seen as a fraction of the maximum squared displacement distance. Furthermore its variance can be expressed as
which shows that uncertainty increases with the maximum displacement distance and with the absolute position of the individual with respect to the conspicuous point. By using this alternative expression, the variance of the OLS estimator can be expressed as:
and the attenuation effect as:
which shows more intuitively the negative effects of geo-masking on the
OLS estimates. The greater the maximum displacement distance is, the larger both the loss in efficiency and the attenuation effect will be.
2.3. Uniform Geo-Masking within a Circle
Let us now turn to analyze the effects of a uniform geo-masking (such as the one employed, e.g., by DHS, 2013 [
4]), that is a mechanism which transforms the coordinates displacing them along a random angle (say δ) and a random distance (say θ) both obeying a uniform probability law. The mechanism can be formally expressed through the following hypotheses:
HP1:
and , with θ* the maximum distance error, and
HP2:
θ and δ are independent.
Assuming again, without loss of generality, that the conspicuous point is located in the origin, the true squared distance between point of coordinates (
i,
j) and the conspicuous point before geo-masking is measured by
, while, after geo-masking, it can be expressed, using the polar coordinates, as:
Expanding Equation (12) we obtain:
so that we can now express the measurement error as:
Similarly to the Gaussian case we have a non-zero mean and a non-constant variance, given by:
and
Again the proofs are left to the appendices, specifically to
Appendix C.
So, consistently with the results obtained with a Gaussian geo-masking and according with the intuition, the measurement error increases its variance as the maximum displacement distance θ* increases and as we move away from the conspicuous point.
If we use this result again to provide an explicit expression to the estimation variance and to the attenuation effect, we have, respectively:
and
which lead to very similar conclusions to those found for the Gaussian geo-masking (see Equations (10) and (11)). The greater the maximum displacement distance is, the lower the precision and the larger the attenuation effects are.