1. Introduction
At certain times a large sample is not representative of the population, but it is biased (B3D). Some of the problems coming from ignoring sampling bias in big data statistical analysis have been recently reported by Cao [
1]. A good example cited by Crawford [
2] is the data collected in the city of Boston through the StreetBump smartphone app that underestimates the number of potholes in some neighborhoods of the city, with the consequent deficient management of resources. Another example is the database of more than 20 million tweets generated by Hurricane Sandy. These data come from a biased sample of the population, since most of the tweets came from Manhattan, while few tweets were originated in the most affected areas by the catastrophe. In other examples, such as those cited in Hargittai [
3], survey data show that the use of sites is biased yielding samples that limit the generalizability of findings.
In this context, let us consider a population with CDF F (density f) and consider a SRS, , of size n from this population. Assume that we are not able to observe this sample but we observe, instead, another sample , of a much larger sample size () from a biased distribution G (density g), such that , for some weight function , .
2. Mean Estimation in B3D
To deal with the mean estimation problem in this context, we propose the realistic estimator (unknown
w case) whose motivation is explained by Cao and Borrajo [
4]:
In order to work with this estimator, extra information is required. We propose a scenario in which, in addition to the biased sample,
, we also observe a SRS,
, of small size of the real population. The Parzen-Rosenblatt KDE (see [
5,
6]) based on
and
can be used to estimate
f and
g.
The final expression of the AMSE of (
1) (
,
,
,
and
) is:
3. Case Study with Simulated Data
Let us consider
and
(
Figure 1a):
Figure 1b shows that the proposed estimator improves the estimation performed using the SRS,
, and the biased sample,
, for a large number of combinations of
h and
b. Looking at
Table 1, we observe that the best choice for
h and
b based on the simulation study contradicts the assumption (
,
) used in obtaining the asymptotic results. The AMSE for (
1) under these non-standard asymptotic conditions (
,
) is:
4. Conclusions
Big Data brings new statistical challenges since bias is much more present. Ideas from length-biased data and nonparametric smoothing techniques are important in this context, testing for bias is a relevant problem in Big Data and smoothing parameter selection may be paradoxical in B3D.
Funding
This research has been supported by MINECO Grants MTM2014-52876-R and MTM2017-82724-R and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015 and Centro Singular de Investigación de Galicia ED431G/01), all of them through the European Regional Development Fund (ERDF). The second author’s research was sponsored by the Xunta de Galicia predoctoral grant (with reference ED481A-2016/367) for the universities of the Galician University System, public research organizations in Galicia and other entities of the Galician R&D&I System, whose funding comes from the European Social Fund (ESF) in 80% and in the remaining 20% from the General Secretary of Universities, belonging to the Ministry of Culture, Education and University Management of the Xunta de Galicia.
Conflicts of Interest
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
AMSE | Asymptotic mean squared error |
B3D | Big-but-biased Data (BBBD) |
CDF | Cumulative distribution function |
KDE | Kernel density estimator |
MSE | Mean squared error |
SRS | Simple random sample |
References
- Cao, R. Inferencia estadística con datos de gran volumen. Gac. RSME 2015, 18, 393–417. [Google Scholar]
- Crawford, K. The hidden biases in big data. Harv. Bus. Rev. 2013. Available online: https://hbr.org/2013/04/the-hidden-biases-in-big-data (accessed on 4 April 2016).
- Hargittai, E. Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites. Ann. Am. Acad. Political Soc. Sci. 2015, 659, 63–76. [Google Scholar] [CrossRef]
- Cao, R.; Borrajo, L. Nonparametric Mean Estimation for Big-But-Biased Data. In The Mathematics of the Uncertain, Studies in Systems, Decision and Control; Springer: Cham, Switzerland, 2018; Volume 142, pp. 55–65. [Google Scholar]
- Parzen, E. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
- Rosenblatt, M. Remarks on some nonparametric estimates of a density function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).