1. Introduction
Precise numerical data are at the heart of classical statistics, and various researchers have developed estimators to calculate the mean of a finite population using auxiliary information. When there is high correlation between the study variable and auxiliary variable, using a ratio estimation method instead of only considering the study variable can significantly reduce the sampling error. This results in a smaller required sample size while maintaining precision, as noted by Cochran [
1]. Ratio estimation techniques have been extensively researched, with different types and uses developed over time. Researchers have explored various transformations of known parameters and statistics as auxiliary variables. Recent studies have shown that utilizing diverse types of auxiliary information can improve the performance of ratio-type estimators. For example, some scholars have suggested using exponential-type ratio estimators and refining their performance. Others have investigated the estimation of the mean through exponential ratio-type estimators in the presence of non-response. One study proposed an estimator that utilizes complete information, which outperforms exponential ratio-type estimators. Additionally, a study has examined the implementation of a ratio-type estimator for multivariate k-statistics and explored the use of auxiliary information with the coefficient of variation. These advancements were discussed by Robson [
2] in his research. Tahir et al. [
3] developed neutrosophic ratio estimators in simple random sampling. Vishwakarma and Singh [
4] extended their work in ranked set sampling scheme. Yadav and Smarandache [
5] and Kumar et al. [
6] defined generalized families of neutrosophic ratio and exponential estimators.
While classical statistics assumes precise data, fuzzy logic provides a solution for data that may not have exact measurements. Fuzzy statistics is a useful tool for analyzing data with fuzzy, ambiguous, uncertain, or imprecise parameters or observations. However, it does not account for the degree of indeterminacy in the data. Neutrosophic logic is an extension of fuzzy logic that enables the measurement of both the determinate and indeterminate parts of the observations. It is used to analyze data with vague or uncertain observations, as noted by Smarandache [
7].
When data have some degree of indeterminacy, neutrosophic statistics are used. This statistical methodology goes beyond the traditional approach and is employed in situations where the sample or data contains neutrosophy. Neutrosophic statistics are particularly useful when observations within the population or sample are ambiguous, uncertain, and indefinite, as explained by Smarandache [
7].
1.1. Neutrosophic Statistics and Hartley–Ross-Type Estimators
Neutrosophic statistical methods are utilized to examine datasets that contain some level of uncertainty, also known as neutrosophic data. In this type of statistics, the sample size may not be accurately determined, as explained by Smarandache [
8]. Smarandache’s research [
8] has shown that neutrosophic statistics are highly effective in analyzing systems of uncertainty. In the field of rock engineering, neutrosophic numbers have been used to investigate the scale effect and anisotropy of joint roughness coefficient. This has led to a more efficient method for overcoming information loss and generating adequately fitted functions, as demonstrated by Chen et al. [
9]. Additionally, a new technique called neutrosophic analysis of variance has been introduced for analyzing neutrosophic data. The field of neutrosophic statistics is currently being advanced by exploring new areas such as neutrosophic interval statistics (NIS), neutrosophic applied statistics (NAS), and neutrosophic statistical quality control (NSQC).
Hartley and Ross [
10], Robson [
2], Murty [
11], and Smoo et al. [
12] have studied various unbiased estimators for population mean. Hartley and Ross [
10] have devised new ratio-type estimators for estimating the population mean, and their work has been further enhanced by other survey statisticians. When the variables have a negative correlation, Singh and Singh [
13] have proposed unbiased Hartley–Ross estimators for the population mean. Additionally, Singh et al. [
14] have developed modified ratio-type estimators based on Hartley and Ross estimators, incorporating additional information such as coefficient of variation, correlation, and more. Kadilar and Cekim [
15] have drawn inspiration from Hartley–Ross’s work and developed the Hartley–Ross-type regression estimator that utilizes auxiliary information. In this article, we will investigate the application of neutrosophic OLS and robust regression coefficients in Hartley–Ross-type neutrosophic mean estimators.
1.2. Research Gap
Previous research in survey sampling has primarily focused on data that are precise, certain, and unambiguous. However, such methods may generate a single, clear-cut result that could be inaccurate, overestimated, or underestimated, which poses a limitation in certain cases. Conversely, there are situations where data are of a neutrosophic nature, and in such cases, classical statistical methods are inadequate. Data with a neutrosophic nature are often characterized by uncertain and ambiguous observations, non-clear arguments, and vague interval values. Therefore, data collected from experiments or populations can be expressed as interval-valued neutrosophic numbers (INN), where the observation is assumed to fall within the boundaries of the given interval. In reality, indeterminate data is more common than determinate data, making it necessary to develop more neutrosophic statistical techniques to analyze such data.
Gathering data on numerous variables throughout life can be a costly endeavor, especially when the data are uncertain. Therefore, relying on traditional classical methods to determine the unknown true value of the population for ambiguous data can be both risky and expensive. Furthermore, if both the primary study variable and auxiliary variables have a neutrosophic nature, conventional Hartley–Ross-type estimation is inadequate. As a result, this study suggests the use of neutrosophic Hartley–Ross-type regression estimators.
A comprehensive examination of published studies [
3,
4,
5,
6] reveals that no research has been conducted in the domain of survey sampling to estimate an unknown population mean using Hartley–Ross-type regression estimation methods while auxiliary variables are present under neutrosophic data containing both sensitive and non-sensitive observations. This specific field of statistics necessitates further investigation and study. The present research serves as an introductory step in this area.
1.3. Scope of the Study
Neutrosophic statistical analysis is an approach that is capable of handling data with incomplete or indeterminate information, while also accommodating inconsistent beliefs. In some instances, when data are gathered using certain instruments, observations may exist within an uncertain range, with the possibility that the true measurement lies within that interval. In such cases of indeterminacy, classical statistical methods may not be adequate for analyzing the data. As an alternative, the method of neutrosophic statistics is utilized as a more flexible and general version of classical statistics. While numerous studies in the domain of survey sampling have been conducted under the framework of neutrosophy, the area of Hartley–Ross-type estimation remains a relatively new and underexplored field that warrants further attention.
To start, we assume that the neutrosophic study variable is non-sensitive, which makes the conventional estimation method appropriate. However, we also recognize the existence of sensitive variables that are highly personal, stigmatizing, or threatening, and they are likely to be observed in sampled units using non-standard survey techniques to enhance respondent cooperation. These techniques have been derived from the randomized response theory introduced by Warner [
16] and have been extensively discussed in the works of Fox and Tracy [
17], Chaudhuri and Mukerjee [
18], and Chaudhuri [
19]. Furthermore, we perform a numerical investigation on the sensitive variable to assess the performance of the proposed method by manipulating the values of the study variable using various randomized response models.
This article presents a fresh perspective on mean estimation in the context of neutrosophic data, which incorporates uncertainty not only in the form of imprecision and vagueness but also indeterminacy. The remaining sections of the article are organized as follows:
Section 2 introduces the neutrosophic versions of the OLS-based and regressionbased Hartley–Ross-type mean estimators. In
Section 3, a new class of Hartley–Ross-type neutrosophic robust regression estimators is proposed. The usefulness of these methods in sensitive research is discussed in
Section 4. A numerical example is provided in
Section 5 to demonstrate the effectiveness of the proposed estimators. Finally, in
Section 6, the article concludes by highlighting the potential of neutrosophic robust regression methods in addressing the challenges posed by the complex and uncertain nature of real-world data.
2. Adapted OLS Based Neutrosophic Hartley–Ross Type Estimators
Parametric regression methods are widely used in statistics, with ordinary least squares (OLS) being one of the most commonly used techniques. OLS aims to estimate the model’s parameters while minimizing the sum of residual squares
, and is known for its mathematical simplicity and elegance. The Gauss–Markov theorem requires several assumptions that must be satisfied for OLS to be considered the most suitable estimator for linear regression coefficients. OLS estimators are the best linear unbiased estimator (BLUE) because they have the smallest variance among all unbiased estimators. Readers who are interested in this topic can refer to the work of Al-Noor and Mohammad [
20].
We consider a neutrosophic variable, denoted by
, on a finite population
consisting of N identifiable units. To estimate the unknown mean of
, denoted by
, a neutrosophic auxiliary variable
is used. Note that additional information may be available for the entire population through knowledge of
or
, or on a sample from
in the absence of population information related to study variable. By adapting Hartley and Ross [
10], Singh and Singh [
13], and Kadilar and Cekim [
15], the class of OLS regression coefficient
based neutrosophic Hartley–Ross-type sample mean estimators for the study variable:
where
,
and
and
and
and
and
and
and
and
for each estimator. Where is neutrosophic coefficient of kurtosis, is neutrosophic coefficient of variation, and is neutrosophic coefficient of correlation. Further is representing sample observations of auxiliary variable. is the sample mean of auxiliary variable. is representing the mean of observations , for j = 1, …, 8, respectively.
The biases of these estimators are.
where
With
Unbiased estimator of
are:
Substituting
instead of
in
, we have
Using the expressions of
, unbiased version of
is:
The variance of
is as follows:
where
is the neutrosophic variance of study and auxiliary variables.
is the OLS regression coefficient. Further
Note that all the notations are used in neutrosophic form where , ratio , , , , , , , , , and . is representing observations of study variable. The estimators are only using neutrosophic OLS regression coefficient.
3. Proposed Robust Regression Based Neutrosophic Hartley–Ross Type Estimators
When a dataset contains outliers or anomalous data points, the efficiency of ordinary least squares (OLS) estimates can be compromised. The breakdown value of OLS fitting is 1/n or
, indicating that it can be easily influenced by a single outlier, as pointed out by Hampel et al. [
21] and Rousseeuw and Leroy [
22]. According to Seber and Lee [
23], the OLS method’s susceptibility to outliers in data can be attributed to two main factors:
When using the squared residual to estimate the residual size, any residual with a greater magnitude will have a disproportionately larger effect on the overall size compared to the other residuals.
Using a conventional location measure, such as the arithmetic mean, which is not resistant to outliers, may result in a significant impact on the criterion due to a large squared value, which in turn can lead to a disproportionate effect on the regression results.
To reduce the influence of outliers on regression results, one may opt for alternative regression methods that are less sensitive or affected by outliers, such as robust regression. To gain a better understanding of these robust regression techniques, readers can consult the research conducted by Yu and Yao [
24].
Ordinary least squares (OLS) regression is a widely used method for estimating the parameters of a linear regression model. It works by minimizing the sum of squared residuals, which represent the difference between the observed and predicted values of the dependent variable. The objective is to identify the best-fit line that can explain the relationship between the independent and dependent variables while minimizing the error in the model. However, the OLS method relies on the assumption that the residuals are normally distributed and have constant variance, which is not always the case in real-world data.
OLS regression is highly sensitive to outliers, which are observations that significantly deviate from other data points. Outliers can occur naturally or due to measurement or data entry errors. The impact of outliers can be significant, as they can affect the estimates of regression coefficients and lead to inaccurate predictions. In contrast, robust regression is a technique specifically designed to handle outliers more effectively, and is therefore less sensitive to their presence in the data.
Robust regression is one class of robust statistical methods that aims to minimize the sum of the absolute residuals rather than the sum of the squared residuals. By focusing on the absolute values of the residuals, this approach is less susceptible to the influence of outliers, as extreme values do not disproportionately impact the results. Rather than assigning equal weight to all observations, robust regression assigns lower weights to outliers, which reduces their impact on the regression coefficient estimates. In cases where outliers are prevalent, robust regression is often a more appropriate approach than OLS regression, as it is less affected by extreme values.
OLS regression assumes that the variance of the residuals is constant across all levels of the independent variable, which is known as homoscedasticity. However, this assumption may not hold in real-world data, particularly when the dependent variable is not normally distributed. Biased estimates of the regression coefficients and unreliable predictions may result from this inconsistency. Another possible limitation of OLS regression is its sensitivity to outliers or unusual data points, which can impact the estimates and decrease efficiency. However, the weighted least squares method employed by robust regression can address this problem by assigning higher weights to observations with smaller variances and lower weights to those with larger variances. This technique allows for more precise estimation of the coefficients, even when outliers or influential points are present.
Robust regression outperforms OLS regression when dealing with non-normal data, which violates the normality assumption. OLS regression assumes that the dependent variable is normally distributed, which is not always true in real-world data. Consequently, the regression coefficients obtained from OLS regression may be biased, and the predictions may be inaccurate. Robust regression, on the other hand, does not make the assumption of normality and can handle non-normal data more effectively. We are using four robust regression methods
, least absolute deviations (LAD) for
p = 1, Huber-M (H) for
p = 2, Hampel-M (H-M) for
p = 3, and Tukey-M (T-M) for
p = 4. For further details, please refer to Zaman and Bulut’s work [
25,
26].
The proposed neutrosophic robust regression Hartley–Ross-type mean estimators are
The variance of
is as follows:
4. Neutrosophic Robust Estimation in Sensitive Research
Sensitive topics, such as abortion, xenophobia, tax evasion, drug use, alcoholism, gambling, reckless driving, and sexual behavior, are regarded as intrusive as they violate the privacy of the respondents. When conducting research on sensitive issues, direct inquiries on personal or stigmatizing matters can lead to respondents either refusing to respond or providing inaccurate information, leading to non-sampling errors that significantly undermine the data’s quality and subsequent analyses’ relevance. Survey statisticians have developed various strategies to encourage respondent participation while respecting their privacy to minimize such socially desirable biases. One way to improve the accuracy of responses on sensitive topics is to reduce interviewer influence. Self-administered questionnaires, computer-assisted telephone interviews, computer-assisted self-interviews, and web surveys are commonly employed for this purpose. To avoid socially desirable bias in data collection on sensitive topics due to non-sampling errors, one solution is to utilize indirect questioning methods rather than direct questions.
Chaudhuri and Christofides [
27] suggest various techniques for gathering sensitive information while circumventing direct questioning of survey participants. Among these methods is the randomized response technique, which was initially developed by Warner [
16] as a way to estimate the population mean of a sensitive variable in a study while minimizing interviewer influence. While originally designed for binary variables to assess the prevalence of stigmatizing attributes, it has since been adapted to evaluate sensitive quantitative variables concerning diverse aspects of life, such as personal income level (Barabesi et al. [
28]), extramarital relationships, induced abortions and unwanted pregnancies, tax evasion, weekly hours of undeclared work (Trappmann et al. [
29]), the number of cannabis cigarettes smoked (Cobo et al. [
30]), and the frequency of deviant sexual behaviors that students struggle to control (Perri et al. [
31]). Academic literature proposes multiple methods to safeguard the confidentiality of respondents when gathering data on sensitive variables. These techniques aim to modify responses in a way that conceals the actual values of the variables, using one or more random variables to alter them. The pioneering works in this area were by Greenberg et al. [
32], Eriksson [
33], and Pollock and Beck [
34]. Other researchers have since developed their own distortion techniques, including Eichhorn and Hayre [
35], Bar-Lev et al. [
36], and Diana and Perri [
37,
38]. These random variables must be statistically independent of the sensitive variable and of each other, and the researcher must possess a thorough understanding of their probability distributions.
The present study explores the use of two neutrosophic scrambling variables, referred to as and . To elicit the responses, each participant is instructed to generate a neutrosophic value, , from , and another neutrosophic value, , from . They are then requested to disclose the neutrosophic scrambled value , where is the neutrosophic scrambling function that enables respondents to conceal their true sensitive value. The scrambling function’s particular configuration is established by the chosen scrambling technique, which may be one of the two neutrosophic models elaborated in the subsequent sections.
It is assumed that the values generated by the respondent, and , are not disclosed to anyone, ensuring the researcher’s uncertainty about the true value of and protecting respondent privacy. While it may not be possible to determine individual values of , it is still possible to obtain accurate estimates of certain characteristics of the sensitive variable by taking a sample of n units and using the shuffled responses of all units in the sample, denoted by .
This paper considers two neutrosophic versions of scrambling models: (i) the
additive model (Pollock and Beck, [
34]) and (ii) the
mixed model (Saha, [
39]). Using the scrambling models previously discussed, it is possible to obtain an unbiased estimate of the unknown mean
of the sensitive variable by employing the scrambled values
and computing the estimator’s variance. As an example, assume that we use the additive model to distort the true responses and intend to estimate the mean of
using a simple random sample without replacement (SRSWOR) of
n units taken from the population. With this approach, the estimation process becomes straightforward. Let
be a scrambling variable with a mean of
and a variance of
. As the distribution of
is known,
and
will also be known beforehand. To obtain an estimate of the unknown sensitive mean
using the additive model, each selected respondent is directed to generate a number
from
using a computer or smartphone application, and add this number to their true value
. The respondent is then instructed to release the scrambled response
, while keeping the generated value
confidential. The estimator
is an unbiased estimator of the unknown
, and its unbiasedness is assessed with respect to the scrambling device. This can be expressed using the expectation operator E, such that:
Using the notation introduced in previous sections, it becomes evident that for
:
with
, is a design-unbiased estimator of
with variance
.
The mixed model can also be treated using the same method. However, to maintain brevity, we have excluded the specifics from this explanation. Readers who want to know more can consult Shahzad et al. [
40].
Proposed Neutrosophic Hartley–Ross Type Estimators for Scrambled Responses
Building on the concepts mentioned above, we adapt the suggested class to handle cases where the objective variable is sensitive, and information is obtained utilizing the four scrambling methods described earlier. Let
denote the neutrosophic responses observed on a sample selected from the population under study using SRSWOR. Let
. We can obtain the class of neutrosophic estimators with scrambled responses by replacing
y in
with
z
To obtain the variance equations, we can substitute the population parameter for
with the relevant population parameters for
in Equation (
8). However, to maintain brevity, we abstain from presenting the altered formulas, as they can be effortlessly derived. The identical approach can be utilized to extend all the estimators discussed in the previous
Section 3 and
Section 4 to the sensitive setting, which will be explored in the following section.
5. Numerical Illustration
As the idea of neutrosophic Hartley–Ross-type estimators is relatively novel and no prior research is available on the topic to the authors’ understanding, a study is conducted to compare the variance performance of different neutrosophic estimators. Specifically, we compared the proposed neutrosophic robust regression Hartley–Ross-type estimators (
Section 3) and the adapted neutrosophic estimators (
Section 2) to determine which performed better. Typically, selecting the best estimator from a set of estimators is based on the one that results in the lowest variance.
We conducted a numerical analysis using interval data with indeterminate values from the Islamabad Stock Market, specifically the United Bank Limited (UBL). The data consist of neutrosophic values that are uncertain and fall within a certain range. We chose UBL’s stock market share price data because the price values fluctuate within a range, with the recorded share price for the day potentially being the highest, lowest, or any value in between. The data were obtained from the publicly available website (
https://pkfinance.info, accessed on 14 September 2021), and approval was not needed.
Table 1 and
Table 2 present the neutrosophic features of the data, which were obtained using the share price index values for 2019 and 2020 denoted as
and
, respectively. It is worth mentioning that
N = 239 and
n = 30.
For the sensitive scenario described in
Section 4, two neutrosophic scrambling techniques are used to perturb the target variable values. These techniques, proposed by [Shahzad perri hanif], involve the use of scrambling variables
and
, which are assumed to follow a normal distribution. The neutrosophic mean of these variables is [0, 0], and their neutrosophic standard deviation is
of the standard deviation of the auxiliary variable.
5.1. Simulation Study
For simulation study, we generate neutrosophic population of size 1000. Taking motivation from Aslam et al. [
41] and Shahzad et al. [
42], neutrosophic random variables
and
generated as follows:
where
follows neutrosophic Gamma distribution i.e., Gamma
.
and
.
follows neutrosophic standard normal. Further,
,
and
.
By adapting Shahzad et al. [
42],
samples of sizes 100 were chosen independently under the simple random sampling design, and for the
hth sample, the mean estimate
of study variable was calculated. Note that
is representing adapted estimators and
is representing proposed estimators.
The variances were calculated from the formula given below
The calculated variances and percentage relative efficiencies (PREs) are presented in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8,
Table 9,
Table 10,
Table 11,
Table 12,
Table 13 and
Table 14. Where
5.2. Remarks
We explore the following points from numerical investigation:
Table 3 and
Table 4 present the results of the numerical study in terms of the variance and PRE. For the non-sensitive case, it is noteworthy that all of the estimators belonging to the proposed class
perform better than the adapted estimators.
The results displayed in
Table 5,
Table 6,
Table 7 and
Table 8 also support the excellent performance of the estimators from the proposed class
when estimating scrambled responses. Regardless of the specific scrambling device utilized, all the estimators from the proposed class consistently outperform the adapted estimators.
The results of simulation study in
Table 9,
Table 10,
Table 11,
Table 12,
Table 13 and
Table 14, also show same behavior i.e., superiority of proposed class
.
6. Conclusions
The point estimate in survey sampling has a drawback of fluctuating across different samples due to sampling error, as it provides only a single value for the parameter being discussed. However, the neutrosophic approach, introduced by Florentin Smarandache, offers a valuable solution for estimating parameters in sampling theory. It provides interval estimates that have a high probability of containing the parameter. Consequently, the neutrosophic technique, which extends the classical approach, is employed to handle data that are ambiguous, indeterminate, or uncertain. In this paper, a novel class of Hartley–Ross-type estimators is proposed for estimating the population mean using neutrosophic robust regression. The proposed estimators are shown to outperform several other adapted estimators. The approach is based on recent advancements in neutrosophic statistics and is applied to both standard and sensitive settings where the target variable is protected using scrambling techniques. The performance of the proposed estimators is evaluated using UBL data and simulation. It is demonstrated that they outperform other estimators in both settings. In future studies, the work can be extended for other sampling designs.