1. Introduction
The provision of labeled datasets in the most efficient possible way is of paramount importance to enhance the categorization capacities of machine learning methods, particularly supervised learning [
1,
2]. Data labeling systems and applications are enabling tools to steer annotation practices through the cooperation of a large number of volunteers conducting a set of minor tasks. Herein, the volunteers are called individuals, contributors, users, crowds, and/or workers, while the microtasks include common labeling activities such as identifying and marking specific facets of images. For instance, as illustrated in the literature, user experience is essential to designing efficient and adaptable interactive industrial Internet of Things systems [
3]. Using the wisdom and cognitive ability of an undefined network of individuals to respond to an open call is also denoted as crowdsourcing [
4,
5,
6]. Generally, crowdsourced data labeling refers to the process of assigning tags to raw data (such as images or text) by leveraging the contributions of a large group of people, often through online platforms.
The current paper explores a data labeling system that facilitates the crowdsourcing practice of annotating large-scale datasets from the perspectives of payment mechanism design and reporting. Together with the established aggregating metrics, the modifications adopted in the payment mechanism provide a basis for the system’s development and usability to handle a vast quantity of data. The proposed techniques, recommendations, and findings are the result of implementing a real data labeling system denoted POBEL developed by a large technological solution provider called FANAP Co.
In this paper, we address two critical components in the design of an efficient data labeling system: an innovative payment mechanism and a robust configuration of output results. The payment mechanism is designed to incentivize user participation and maintain the integrity of the labeling process by incorporating a skip-based golden-oriented function. This function not only balances user penalties but also mitigates spam activities. On the other hand, the configuration of output results is managed through a comprehensive reporting framework. This framework measures the aggregated results and accuracy levels, ensuring the reliability of the annotated outputs. By focusing on these two aspects, our approach aims to enhance the overall efficiency and effectiveness of crowdsourced data labeling systems.
In a nutshell, labeling is the act of detecting particular characteristics of raw data and associating them with factual points to address their correct functionality. A typical example is the selection of correct images that have a certain feature (e.g., being an animal) among a set of given data. Pinpointing the sense of sentences constitutes another feature of the labeling activity in which positive, negative, or neutral tones are detected. Procuring labeled clean data about any concept is vital for the training and pattern discovery qualities of machine learning algorithms. At the same time, it is infeasible to assign the responsibility of labeling a high volume of data to a single or few individuals [
7]. Thus, data labeling crowdsourcing-based applications are a workable solution designed to provide a platform for feeding the data required by the underlying algorithms [
8]. One of the prevalent topics is how to motivate the crowd to participate in the labeling practice using financial incentives. The crowd should be paid based on their performance issuing high-quality labels, which requires a systematic payment mechanism. Such a mechanism is the central element of any data labeling system and thereby it conditions the behavior of other elements like the distribution strategy of items and the aggregation of the results’ configuration.
A well-known research stream related to the management of the crowd-workers’ payment mechanism is the utilization of golden items within the pool of data that is going to be annotated. Golden items are a pre-specified batch of data whose correct labels are known by the systems’ admin but unrecognizable to ordinary users. By mixing golden items into the pool of data, the performance of the users can be measured by comparing the quality of their answers to the golden items [
9]. The current study applies the golden approach for controlling the payment mechanism of the proposed data labeling system. In this regard, the closest study to ours relates to the skip-based approach of [
10]. The authors proved that their approach was the most reliable one for satisfying the no-free-lunch axiom. This axiom hedges against paying more credit than the lowest possible one to the workers that assign incorrect labels to the golden data.
Given a number of golden items within the main pool of data, ref. [
10] designed a multiplicative credit function to set the workers’ payments between the pre-defined minimum and maximum thresholds. In particular, the scores of users increased exponentially based on their correct responses to the golden data. The rate of increase followed the inverse amount of the significance level. Since the significance level was between zero and one, the rate easily became greater than one. On the other hand, users’ credit plummeted to the minimum threshold when assigning one wrong label to the golden items. Users could also skip the items and keep their score intact when doubtful about the correct labels of the items. This paper customizes the function proposed by [
10], providing practical solutions to resolve the following concerns of their study—each of them leading to a research question that will be addressed in this study:
First Concern (C1): Shifting the credit of users to the lowest level due to the submission of a single error seems an overly strict rule. That is, an individual who has submitted 99 correct labels to the golden items and now reserves USD X in his/her wallet can lose the whole credit by submitting one wrong label. Under such a rigorous condition, the workers of a new crowdsourcing business may consider the scoring method unfair and halt their contributions.
- o
Research Question 1 (
RQ1): How can the credit function of [
10] be modified to reasonably alleviate the rigorous condition applied to the provision of incorrect answers?
Second Concern (
C2): Ref. [
10] did not provide information about whether all golden items are supposed to be purely positive in the True/False label type. Consider an image dataset of celebrities in which individuals are asked about the conformity of a specific photo with a given celebrity’s name. What would be the possible consequence(s) if all golden items were assigned True-type labels (e.g., the correct label for all the golden items was True)? It would then be possible for a spammer to assign the True label to all data and collect the whole credit since the penalty is only activated when False-type labels are assigned to golden items. In this way, the spammer would not be penalized when submitting True-type labels to non-golden data that do not belong to a given celebrity.
- o
Research Question 2 (RQ2): What changes related to single-type golden data must be made to avoid the cheating action of spammers?
Third Concern (C3): Consider now the settings of the function’s parameters. The proposed credit function could become inappropriate over large-scale datasets if workable operations are not adopted to tune the underlying parameters. By neglecting parameter tuning, a user who contributes by labeling a small percentage of the entire large-scale dataset would obtain a negligible credit of less than one unit. Since the formula obeys an exponential distribution, the growing trend of the credit only becomes tangible after a significant number of dataset items have been labeled. Therefore, the user may doubt the trustfulness of the proposed data labeling system.
- o
Research Question 3 (
RQ3): How can the credit function parameters of [
10] be tuned?
Fourth Concern (C4): The way golden questions are distributed into the pool of ordinary data may be influential on the viewpoints of the users. For instance, suppose that 10 golden items out of 100 dataset items are shown in a row to the users. Then, a user may think that the credit function does not work properly as his or her score stops increasing when labeling the remaining 90 items. This constitutes a potential drawback derived from the distribution of golden data.
- o
Research Question 4 (RQ4): What kind of practical yet easily implementable distribution mechanism(s) can be employed to enhance the efficacy of the associated credit function?
Fifth concern (C5): The implementation of the aforementioned framework within a real data labeling system requires a specific reporting format to configure the outputs. In this regard, the corresponding payment and distribution mechanisms must deliver clean annotated data to the customer. These features have rarely been studied and need to be discussed.
- o
Research Question 5 (RQ5): What metrics should be considered in the reporting framework of the proposed data labeling system?
Addressing and answering the above research questions constitutes the main contribution of the current manuscript relative to previous studies. In particular, the development stage of the proposed data labeling system has been specifically designed to address the above concerns. The solutions applied resort to best practices together with heuristic and workable approaches.
The rest of the paper proceeds as follows:
Section 2 reviews the literature;
Section 3 explores the working structure of the proposed data labeling system and discusses the payment mechanism;
Section 4 presents different sensitivity analyses and
Section 5 configures a reporting template for aggregating the outputs; and
Section 6 concludes and provides future research recommendations.
2. Literature Review
This section reviews the recent applications of data labeling systems, different types of payment mechanisms, as well as prospective quality control and aggregating metrics.
Research involving the design of crowdsourcing-based applications to handle the various requirements pertaining to data labeling problems has consistently grown since 2006 [
11,
12]. Previous studies have shed light on the system-based implementation of data labeling, including speech recognition, environmental assessment, text scan, image detection, and sentiment analysis. For instance, ref. [
13] designed a web-based crowdsourcing application to obtain a large-scale speech emotion recognition dataset for easing the learning of the speaker-adaptive systems. The application allowed users to select a specific emotion, e.g., fear, per random phrase and record their voice to convey the same sense. Users also had the chance to preview the recorded voice and modify it before submission. Despite considering a convolutional neural network for transfer learning, the authors did not describe how users’ annotations were validated in the proposed crowdsourcing application.
Ref. [
14] used crowdsourcing to assess the post-disaster damage level of constructions by involving citizens. The microtask was to complete a questionnaire composed of a set of simple items. The predefined decision rules together with the answers obtained were used to assess the degree of raw damage. The final degree of damage was estimated through statistical inference and reported to the crisis management office to adopt the required actions and dispatch rescue forces. Ref. [
15] developed a crowdsourcing web application to scan the key terms of scientific papers. The application eased the procedure of finding papers related to keywords, ranking the papers retrieved based on their impact factor while screening, annotating, and classifying the text. The data labeled were aggregated through the majority voting approach. The authors enhanced their previous design by improving the usability and efficiency of the application [
16].
Recent developments focus on the strategies designed to achieve and maintain a critical mass of motivated users [
17], which has led to the introduction of motivational tactics that borrow their main qualities from games, a process known as gamification [
18].
There is also a rising number of studies associated with the analysis of the credit function and payment mechanism in terms of the users’ performance. Ref. [
19] introduced two approaches, namely, majority decision and control group, for evaluating the work submitted by users. In the majority decision approach, all the workers involved were paid, and the aggregation was made based on the most frequent response among the annotations submitted. The control group approach implemented a more rigorous method by delegating a task to a specific worker and subsequently double checking the results submitted by a group of users. If the majority of the control group confirmed the result of the initial user, he/she received the bonus. This method became more applicable when the task of the individuals within the group was cheaper and easier than that of the initial user. For instance, the initial user was supposed to write an abstract about a particular topic while the control group’s users assessed the quality of the work submitted by scanning the text. The analysis performed by the authors showed that both approaches provide a significant level of confidence for detecting untrustworthy annotators. However, the majority decision worked better with low-price tasks, whereas the control group outperformed its counterpart in the case of high-price ones.
Ref. [
20] defined a Nash equilibrium within the incentive mechanism of the crowdsourcing setting in order to minimize total payment. Ref. [
21] suggested a dynamic distribution of the questions to minimize the active labeling duration of a spammer or careless contributors by discovering correlated performance patterns. Their results illustrated the superiority of the dynamic approach relative to the static one in terms of rework rate reduction. Ref. [
22] segmented a well-defined crowdsourcing quality control taxonomy into its model, assessment, and assurance components. The authors showed that ground truth data, inclusive of golden or control questions, could fully measure the performance of users. Figuring out the malicious behavior of contributors in terms of the responses submitted to online surveys was also the research topic of [
23]. These authors developed an approach to evaluate the maliciousness of the contributors and grouped the spammers into five categories ranging from ineligible workers to smart deceivers. Ref. [
24] extracted two strategies from previous studies to enhance crowdsourced labeled data on the task design and after the data collection stages. In the former stage, a real-time feedback system together with a shared workflow between workers and requesters, periodical checkpoints, and a golden-led payment mechanism were all utilized to increase the quality of outputs [
10,
25,
26]. In the latter stage, trust models together with the imposition of replication rules were employed to sift through spamming activities.
In the current study, we incorporate quality control to the task design by introducing practical solutions into the golden-led payment mechanism of [
10]. The introduction of quality control following data collection fosters the application of our approach to real crowdsourcing data labeling systems. In particular, we impose a replication constraint and define reliability metrics across different aggregating report scenarios. As illustrated in
Table 1, the contribution of the current paper to the literature consists in a simultaneous application of practical data labeling-based solutions. This is carried out by tuning the payment function, considering dual-type golden data, incorporating data distribution strategies, and configuring final reports into the data labeling system design practice.
In addition to its current implementation, a variety of information-retrieval scenarios facing reliability frictions arise as potential business applications of POBEL. For instance, its features are particularly relevant in the initial development stages of firms, which require processing large amounts of user data and dealing with potential drawbacks regarding the quality of the information collected [
27,
28]. A similar business application would follow from the collection of reliable data from firm employees and managers to be processed through enterprise resource planning systems [
29,
30].
3. Payment Mechanism
This section studies the features, challenges, and solutions defined to set out an applied credit function when designing a real data labeling system. Prior to describing the mathematical formula, we analyze the functionality of the payment mechanism within the working process of the proposed system. We then describe how the basic version of the credit function evolves into a new and workable one while responding to questions RQ1 to RQ4.
3.1. Position of the Payment Mechanism in the Working Process
Figure 1 describes the general process of the labeling system, emphasizing the position of its credit function. Note that all the settings described are adjustable in the admin panel. Like any other system, the labeling practice starts by receiving the input data from customers and involving the crowd workers. The system must reward workers fairly and deliver accurately labeled data to customers. The design proposed involves several consecutive steps. Users enter the system through an authentication method and select their desirable datasets from among those available. They also determine their contribution level prior to starting, a procedure known as target setting. Users can then start labeling, i.e., submitting the appropriate answer to each item. Data are displayed until the number of labels assigned to each item reaches the pre-defined replication count.
As users submit their answers, credit is updated by inputting the correct and/or incorrect responses to golden items into the proposed function. When a user completes labeling a target, he/she is allowed to collect the corresponding credit by transferring it to a virtual wallet. The last step requires aggregating the data labeled and reporting significant statistics to the system administrator. The payment mechanism is at the core of the labeling system and has a two-way communication relation with the preceding and succeeding steps. That is, a breakdown in communication would result from a failure of the administrator to embed the practical features of the credit function into the body of the labeling system. Moreover, the corresponding function or payment mechanism is crucial for feeding the labeling aggregation and reporting results with the required information.
3.2. Credit Function Payment
We review the original skip-based formula of [
10] before describing the modifications proposed to the credit function of the labeling system. This formula is used to define the basic version of our credit function.
Table 2 describes the notation used together with the corresponding definitions and mathematical domains.
The original formulation of the payment function obeys Relation (1), in which the values of the incorrect, skip, and correct coefficients are equal to 0, 1, and , respectively. Note that the type of golden questions is omitted from the computation, which results in only three coefficients: , , and . An underlying assumption is , which imposes a single label per question. The extensive form of the formula is completed through the , , and exponents included within the multiplication of the corresponding coefficients.
The reasoning behind the choice of is to give the maximum credit to an individual assigning correct labels to all the golden questions. If a contributor responds to all the golden questions correctly, the formula becomes , which yields . Finally, if the respondent skips a choice, a coefficient of 1 is entered into the formula, preserving the credit value intact.
3.2.1. Solution to RQ1 (Preliminary)
If a contributor assigns the wrong label to a single golden question, a zero value is introduced in the formula, resulting in a credit equivalent to
. This is, indeed, the most controversial feature of the original formula, leading to C1. From the perspective of a contributor, it would be discouraging to contribute to a labeling system that returns the minimum credit due to a single error. We provide a milder condition where the default value of
is substituted by an inverse function of
, i.e.,
, where
is derived using sensitivity analysis. This modification resolves
RQ1, though the actual solution will be completed by taking the descriptions of
Section 4 into account.
3.2.2. Solution to RQ2
A subsequent problem is associated with the way golden questions are assigned to the dataset items. For instance, consider
Figure 2, which illustrates a typical labeling case. There are nine pictures per sheet and users must determine whether they belong to the celebrity named (✓) or not (×). The Report (
) and Skip (Go to next) options are also available alternatives. Assume that the administrator of the labeling system sets a specific number of photos as golden items to evaluate the accuracy of the user. In this figure, the star signs attached to some images are symbols of golden items known to the administrator but invisible to ordinary users. Focus now on
C2. Is it important to define a particular type of golden item? What is the practical consequence of setting all golden items using either True- or False-type answers?
At first glance, one may define only True-type golden items since the consistency of a photo with the given celebrity’s name is more important than detecting inconsistent items. In other words, consider the question “Are the images related to Charlie Chaplin (Actor)?”. Assigning True-type labels to the photos that belong to this actor is more important than attributing False-type labels to the ones which are not Chaplin. In fact, when an image is not related to the celebrity addressed, it may belong to any other individual whose identity is irrelevant to evaluate the machine learning practice. By resorting to such reasoning, the administrator would probably consider True-type golden items validating the actions of spammers. That is, a spammer could simply submit a True sign (✓) for all the questions without even considering their content. Since the golden items correspond to True-type responses and wrong answers to the rest of the images do not affect the payment, a spammer would collect the whole credit while labelling the non-golden items wrongly. To cope with the fraudulent actions of spammers, one straightforward solution involves incorporating golden items with False-type answers. To clearly illustrate the idea, golden items with True- and False-type answers are denoted as positive and negative ones, respectively.
In
Figure 2a, three golden images (whose pictures include yellow stars) require submitting a True sign. In
Figure 2b, a negative-type golden image is also defined, namely, the second image in the second row displaying a red star. In this case, the correct answer implies choosing the False (×) sign. Thus, the whole score cannot be obtained by selecting only True icons. We transform the original formulation into Relation (2) by distinguishing the type of golden items and introducing the type of indicator parameter (i.e.,
). This expression doubles the number of coefficients relative to the original function, distinguishing between the bonuses assigned to the positive and negative golden items. We allow for different values of the coefficients assigned to the negative golden items. For instance, the administrator could increase the penalty rate of users who submit incorrect answers to negative items and allow it to approach zero, imposing a severe loss on the credit function.
3.2.3. Solution to RQ3 (Preliminary)
Addressing the third concern
C3 involves tuning the shape parameter (
T), particularly when facing large-scale datasets.
Table 3 displays the trends displayed by the credit values of small- and large-scale instances under different configurations. The data are sorted in terms of correctly responding to a single, half, and the whole number of golden items, absent of any incorrect label (
). Consider the example of [
10], who set
, for a total of 10 questions (
) where 30% are golden (
). Assume now that the maximum and minimum credits equal 80 (
) and 0 (
) units, respectively. Within this framework, the credit obtained by users would equal 20, 40, and 80 after answering 1, 2, and 3 golden questions correctly, respectively. Next, consider a larger dataset with 1000 questions. Applying the same proportions of the previous example, the parameters would be given by
,
, and
. However, selecting a proportional shape parameter would violate the maximum threshold value of 1.
By keeping the shape parameter value unchanged (e.g., ), the income obtained by users when correctly submitting 1 and 150 golden items is negligible. There is also a substantial income jump when answering the last golden question correctly. The resulting credit distribution would lessen the reliability of the system from the perspective of users. Most individuals would dismiss the slow credit increments obtained despite responding correctly to a considerable number of questions. An increase in the shape parameter from 0.5 to 0.95 would still deliver credit values that remain insufficient to stimulate users. The output of the credit function becomes more meaningful as T rises to 0.98. After this value, even marginal decimal increments (e.g., 0.985) would significantly increase the outputs. The shape parameter of the formula requires tuning via sensitivity analysis to yield a workable and encouraging system.
However, as the size of the dataset and the number of golden items increases, sensitivity analysis is insufficient to manage the payments in a real labeling system. Let , , , and . Consider the case with 1000 potential contributors. Assume now that users label a single question and leave the system after collecting the credit. How much will the system pay to this type of user? Any of these users facing a golden item by chance will receive 87.2 units when answering correctly. The existence of 300 golden items implies that the total maximum payment equals 26,160, which exceeds the budget assigned to the labeling practice of such a dataset (e.g., ).
This drawback follows from the fact that the formula does not distribute the total budget linearly among the golden items answered correctly or account for the number of contributors. In addition, the formula does not incorporate a mechanism to avoid violating the budget constraint when multiple individuals complete the labeling of all dataset items. It works well when users label the complete set of questions individually, but this requirement is not feasible with large-scale datasets. The output of the credit function would become inconsistent with the pre-defined budget if the contributors left the system with incomplete labels (i.e., whenever the items labeled are less than N). One solution is to prevent users from receiving any credit until completing all the questions. However, this may be considered an unfair obligation since users may not have enough time to wrap up labeling for large values of N.
Administrators often annotate each item of the dataset with more than two users and aggregate the result to infer the correct labels. At the same time, finding users willing to label the items of a large-scale dataset is far from easy. The proposed simple yet workable solution is to break down the dataset into smaller groups (called targets) and prevent users from collecting their credit prior to completing a target. For instance, assume that administrators require three labels per item. In the example with , a target of size 100 can be defined, requiring 30 users to label a selected target instead of asking three users to annotate 1000 labels each.
We conclude by noting that the solution to RQ3 will be completed through the analyses performed in
Section 4.
3.2.4. Solution to RQ4
The distribution of golden items among ordinary data impacts the output of the payment function and the efficacy of the proposed labeling system. Such distribution should guarantee the inclusion of a pre-defined number of golden items within a target since any additional ones would require overspending the budget of the system. In addition, the order in which golden items are displayed should not be predictable to counteract the actions of potential scammers. The design of an applied distribution mechanism requires choosing between a fixed count of questions for each target and a variable one. In the fixed strategy, the number of images displayed to a user equals the target size. In the variable setting, this number is unknown and images are displayed as long as the user submits True/False answers until the target size is reached. If the administrator follows the fixed strategy, a target size of 100 defines the number of questions shown to the user. The target is terminated whenever he/she submits a True/False/Skip answer to the 100th question. The same target in the variable setting will not be terminated until the user hits 100th True/False answers. In this case, Skip answers are not counted, and questions are displayed until the sum of True and/or False annotations reaches the pre-defined size of the target. To cope with both strategies, straightforward but effective approaches are presented.
Consider the fixed strategy. A division rate for decoupling the questions of each target can be defined as follows: a target of size 40 (including 30 ordinary and 10 golden items) with a division rate of 0.5 is analogous to decomposing the questions into two groups with 20 items each. The contribution of the golden items to each group can be determined through a Bernoulli trial, such as, for instance, mapping 0.2 and 0.8 onto the first and second groups, respectively. The resultant distribution assigns 2 golden items to the first group and 8 golden questions out of the second 20 ones. A final stage of this strategy regards the specification of negative golden questions per target. A percentage of negative golden items can be assigned using a similar intuition as in the general case, e.g., a percentage of 0.2 out of the total would deliver two negative golden items and eight positive ones.
The variable strategy does not limit the number of items shown prior to satisfying the pre-defined target size through True/False labels. This strategy can be implemented by considering a constant number of golden items per sheet (e.g., two golden ones out of the nine images available per webpage). However, such an approach would be vulnerable to budget deficiency since it does not restrict the number of golden items and the credit of a user may go beyond
. If the maximum achievable income is exceeded, the formula can be updated using the pseudo-code described in Relation (3). This equation halts the increasing trend of the function whenever the credit surpasses
. It does so by omitting the coefficient of correct answers. Instead, the current credit (equal to
) of a contributor would be reduced by taking any potential incorrect answers to the golden question into account.
We ensure the random distribution of golden items through division rates, Bernoulli trials, and equal chance of display in the subgroups defined when implementing a fixed strategy. In the variable strategy, randomness is guaranteed by considering a constant number of golden items with randomized placement per displayed sheet. These methods prevent predictability and align with budget constraints by updating the credit formula dynamically.
4. Sensitivity Analysis (Complementary Solutions to RQ1 and RQ3)
We complete the answers to RQ1 and RQ3 by performing a numerical sensitivity analysis of the adjusted credit function. The case-dependent nature of the proposed function prevents us from prescribing a specific formula to derive the corresponding parameters. Instead, we develop a workable procedure to illustrate how the underlying parameters can be tuned. An example is presented to highlight the case-dependent structure of the problem. Further, a sensitivity analysis is performed to provide insights about how to identify the influence of the shape parameter on the function. Capturing such an influence allows us to discuss in detail the penalty coefficient. In this respect, the selection of the penalty coefficient depends on the decision of the system administrator regarding whether or not to choose a strict policy.
We utilize a dataset consisting of 245,000 images of celebrities whose 25% is golden. To aggregate the results, three labels are considered for each image. Thus, the dataset requires 735,000 labels of contributors to be completed, including 183,750 golden items. The total budget equals 15,000,000 units. Thus, we have , , , and . As discussed earlier, the formula works accurately when a single user labels the complete set of N items. Clearly, this requirement is not applicable as a prerequisite to receiving the earnings. We therefore segment the dataset into labeling targets of size . The number of golden items and maximum payment defined within each target equal 125 and 10,204, respectively.
To assess the behavior of the credit function, we consider the following key assumptions. First, incorrect labels on golden items are omitted when deriving the credit distribution function. Users are assumed to either submit the correct answer or skip the question. Second, extreme values of the shape parameter are used to identify variations in the shape of the credit function and determine the value of the corresponding coefficients. Finally, to detect output trends under different penalty rates, all labels are initially assumed to be correct, and then the number of incorrect answers is gradually increased.
Figure 3 illustrates the values taken by the credit function for different levels of the shape parameter as the number of correct answers to the golden items increases. To efficiently assess the behavior of the function, the key assumption is to ignore incorrect labels to the golden items. That is, users either submit the correct answer or skip the questions. For example, a value of 100 in the horizontal axis indicates that out of 125 golden items available, 100 labels have been correctly submitted while the other 25 questions were skipped. This figure intuitively describes the influence of
T on the distribution shape of the credit function. An inappropriate tuning of
T flattens the increasing trend of the credit function, conditioning the subsequent behavior of users. For instance,
and
represent extreme cases illustrating this feature. When
, the outputs of the credit function are approximately zero even if 123 out of 125 golden questions have been correctly answered. Note the considerable jump that occurs when the user correctly labels the last two golden items.
When , the initial output equals 9013, which constitutes a significant percentage of the maximum income, i.e., 10,204. The increment of the function for the remaining items would be considerably slow. Such situations would undermine the trust of users since they could not easily observe their credit progression. Although the final output in all the scenarios proposed equals , the credit distribution over the count of answers submitted is a crucial incentive for attracting new users. Decision makers should determine the value of T that fits better with their strategy for attracting contributors via trial and error. In this case, provides a reasonable distribution of the credit obtained. Answering correctly to 75 golden items and skipping the remaining 50 would lead to 9706, 2225, 462, 52, and units associated with , , , , and , respectively.
Table 4 describes the distribution of the total budget across
reference points for different values of
T. The metric represented defines the ratio of the credit received by a user after submitting a given number of correct answers divided by the total budget available (
). For instance, when
and 75 golden items are correctly labeled, we have
, leading to the value of 4.53%. The unfair distributions of
and
are clearly observable in this table, with users being respectively paid nothing (0) and a large percentage of
(90.48) after submitting 25 correct answers and no incorrect label.
Consider now the incorrect answers to golden items, i.e.,
, described in
Figure 4. The horizontal axis displays the count of wrong labels assigned to golden items with the remaining ones assumed to be correct, that is, the number 100 corresponds to the case where 25 and 100 golden items are correctly and incorrectly labeled, respectively. The figure illustrates the relationship between the count of incorrect labels and the credit function when considering a variety of values for
. This coefficient determines the behavior of the function by replacing the value of zero with
. In this example, x is assigned the values 0.2, 0.5, 1, 2, and 3, with
T = 0.97. Note that the default setting of the original formulation is given by
. In this case, the credit function drops to its lowest level right after the user submits a wrong answer to any of the golden items.
Clearly, when all labels are assigned correctly, the credit function hits the ceiling of 10,204. The strict requirements of the original formula can be smoothed by introducing counterpart values through the inverse form of , namely, the positive power of T. As the number of incorrect answers approaches 0, the function converges to the default setting while divergencies increase with the number of incorrect answers. Consider the case with 25 incorrect labels and 100 correct ones. When equals 0, 0.994, 0.985, 0.97, 0.941, 0.913, and 0.737, outputs are given by 0, 4091.88, 3256.2, 2225.14, 1039.08, 485.23, and 2.35, respectively. The data labeling system administrator must select the appropriate penalty, ranging from a strict , to a more lenient one, .
We provide additional intuition by defining a metric that measures penalty intensity. Assume that the credit function is independent of the number of wrong answers by adjusting
. The penalty intensity ratio described in
Table 5 is determined by the relative difference between the independent credit and the one received. For instance, consider the case with 25 wrong and 100 right answers. Absent of any penalty, i.e., with
, the credit function equals 4765.01. When the credit received is based on
, which yields 2225.14, the penalty intensity ratio is given by
, that is, 0.533. That is, introducing
leads to a 53.3% decrease in the credit of users.
Despite the reliability illustrated through the sensitivity analysis, a simple typo could still lead to setting an inappropriate value for T. To counteract this possibility, we recommend performing a small scale test prior to starting the Go-live labeling of a new dataset. A group of pre-selected reliable users can be enrolled to validate whether the tunned parameters of the credit function, particularly T, work well or not. After the accuracy of the credit function has been validated by the system administrator, the Go-live process of the labeling associated with a new dataset can be initiated.
5. Aggregating Results (Solution to RQ5)
RQ5 involves retrieving the outputs of the labeling system and delivering managerial reports. In this regard, the ultimate goal of the system consists in providing feedback on the outputs obtained from a labeling practice. This process encompasses the following tasks: aggregating the responses received per item, measuring the reliability of answers, assessing the volume of remaining work, and validating the accuracy of golden items as well as the reports describing the financial performance of users.
The aggregation of labeling results will be tackled by separating the analyses related to ordinary and golden items. The answers to ordinary items are unknown to the administrator while golden questions are assigned default labels. Therefore, the analysis of ordinary items should aim at deriving dominant responses under different aggregation rules and reliability metrics. The reason for double checking the responses to golden items is to evaluate the accuracy of their default labels. The answers to golden items are meticulously defined under the supervision of the administrator. Despite this fact, incorrect default labels could still be assigned.
Table 6 and
Table 7 provide a set of metrics for extracting information regarding the status of annotated items within a labeling system.
The status of ordinary items can be analyzed using different metrics. It is crucial to assess the percentage of ordinary data that have received the required number of labels, defined as complete items. This implies differentiating between the contribution of items with dominant, semi-dominant, and non-dominated results. If there is a threshold of three labels per item, three synchronous answers (e.g., True or False) constitute a dominant result, showing a strong consensus of users over the corresponding item. In the case of two similar labels (e.g., True) and one opposite label (e.g., False), the system assigns a semi-dominant state to the item. The non-dominant status occurs when the three users report the item or provide distinct responses (i.e., True, False, and Report). The complete items with a dominant result will display True, False, or Report labels. Among the incomplete ones, the rate of unlabeled items, as well as cases with at least a single answer submitted, will convey useful information to the admin.
We must also define a reliability criterion for the aggregated data. The criterion proposed maps the accuracy of the responses of contributors to golden data into the complete dominant items. In particular, it specifies the contribution percentage of highly, moderately, and lowly accurate users in the answers submitted to dominant items. A highly accurate quality is attributed to a user whose number of correct responses associated with the golden items divided by the total number of golden items labeled is greater than or equal to 80%. When this amount is between 50% and 80%, the corresponding user is assigned a medium level of accuracy while a percentage below 50% implies that the system faces a low accurate user. Note that the threshold values may vary from case to case and should be set based on the system administrator criteria and preferences. The system must therefore allow for the administrator to contable the threshold values accordingly.
We must note that although our approach is primarily designed to dissuade spammers from participating, a specific strategy is implemented based on our reliability criterion. In particular, the system administrator should set a threshold for the minimum acceptable amount of the average accuracy rate of users who labeled a dominant item. A low value of this criterion is, indeed, the prospective outcome of spammer (or, equivalently, careless user) activity. As the value of the criterion falls below the threshold, the system changes the status of the completed item to incomplete by omitting the labels of the corresponding spammers. The item can be further completed through the contribution of other users
The participation rate of each user type per complete dominant item constitutes a reliability indicator for the system administrator. As the portion of dominant items with highly accurate users increases, the overall performance of the labeling practice becomes more promising. Furthermore, the consistency level of the golden items demonstrates whether they have been labeled correctly. If more than 80% of the answers submitted to a golden item correspond with its pre-determined label, the consistency level is regarded as high. In the same vein, if this number is between 50% and 80%, or lower than 50%, the consistency levels are defined as medium or low, respectively. The distribution of the items answered between golden and non-golden is yet another metric that can be used to analyze the output of the system.
Given the key contribution of output objects to machine learning practice, assessing the overall accuracy of labels across the entire dataset may not provide a meaningful measure of performance. In our framework of analysis, the influence of each labeled item may vary significantly. Mislabeled items can have negative ramifications and lead to excessive processing costs depending on the context and specific machine learning application. We have therefore proposed a reliability criterion to analyze the quality of labels on an item-by-item basis. This approach ensures that the output of each completed item can be identified and categorized as low, medium, or high quality. If the administrator is not satisfied with the quality level of certain items, the system allows for the corresponding items to receive further labels until the desired quality is reached. In this way, overall improvements are guaranteed for the labeled dataset by focusing on the individual measurement of the accuracy of the items’ labels.
Table 8 describes the labeling result of a hypothetical dataset including 10 images. The images are listed
through
where the second (
), fifth (
), and eighth (
) ones are positive, positive, and negative golden types, respectively. In this case, the correct label of the positive (negative) golden items is True (False). A, B, and C are three users contributing to the labeling practice. The possible responses are True (T), False (F), and Report (R) with three labels required per item. If doubtful about the correct label, users can skip (S) the images. For instance, user A has skipped the image while both B and C have assigned it a True label.
Table 9 graphically and numerically represents the metrics related to the ordinary items. For instance, to compute the
No label incompletion metric, the denominator enumerates the items that do not receive the three pre-defined labels, i.e., two items, including
and
. Actually,
has no labels assigned since all users have skipped it and
is incomplete since it has only been labeled by B and C and skipped by A. The numerator is defined by the number of items that do not have a label, i.e., item
. Thus, the
No label incompleteness rate is
, which means that 50% of the incomplete items have no label.
This result also implies that dominant items have been completed based on the accuracy of the responses of medium-level users to the golden data. That is, dominant complete items , , and are labeled by a group of users whose average responses to the golden data range between 50% and 80%. User A has responded correctly to all golden data, obtaining the whole accuracy score. User B has provided a wrong answer to , leading to two right responses out of the three golden ones and an accuracy rate of 66.67%. User C has only labeled two golden items and the response to is incorrect, resulting in an accuracy rate of 50%. The average rate of users who contribute to labeling the complete dominant items equals 72.22%. This value implies that 100% of the dominant complete items have been labeled by users with a medium accuracy level.
Table 10 describes the results derived relative to the golden items. To determine the consistency level of the golden data, we must first calculate the fraction of users providing consistent responses to the default positive and negative labels. For instance, two users have labeled
via correct True responses. Hence, the responses of users are fully consistent with the default positive label of
, resulting in a 100% consistency level. Similarly, the corresponding level for both
and
is 66.67%, consistent with their default labels. The report delivered to the system administrator should state that 33.33% and 66.67% of the golden items involve a high and medium level of consistency, respectively. Note that there is no golden item with a consistency level below 50%. Moreover, out of 25 labels registered, 17 of them correspond to ordinary items and the 8 remaining answers are allocated to the golden items. In other words, 32% of the labeling is devoted to golden items whereas ordinary data account for 68%.
The functionality of the metrics proposed in our system can be compared in detail with those of recent data labeling systems and algorithms. The labeling system designed by [
31] introduces a reliability metric associated with the performance of users—while ours provides a bridge between the latter and the quality of the data labeled. These authors also set a minimum reliability threshold below which the responses of the corresponding users are omitted from the final aggregated results. Ref. [
32] designed an accuracy metric to evaluate the output results obtained. Their metric was defined in terms of the number of tasks (items) whose estimated labels (aggregated result) were consistent with their True labels (golden ones) divided by the total number of tasks. This definition implies that the accuracy metric in their study does not lead to an entity-based output. The corresponding accuracy metric is, indeed, applied to justify the entire dataset. Conversely, as discussed above, our paper proposes an item-by-item evaluation of accuracy, highlighting its entity-based nature. The data labeling system presented by [
16] introduces a metric called precision that specifically measures the accuracy of individual annotations against a gold standard. They also define an F1 score, which is a summary statistic that provides an overall measure of performance but does not offer detailed insights into specific aspects of the labeling process. In contrast, our metrics provide a more granular analysis of performance in light of the labeling process.
The metrics proposed provide a comprehensive overview of the labeling process, ensuring informed decision making. By aggregating the responses received per item, the administrator can evaluate the consensus existing among users and identify items needing further attention. Being able to measure the reliability of answers allows the administrator to evaluate the accuracy and consistency of user contributions, identify high-performing users and address potential problems with less reliable contributors. Assessing the volume of remaining work enables efficient resource allocation and prioritization of tasks. Additionally, validating the accuracy of golden items and analyzing reports on financial performance ensures that the labeling system maintains high standards and aligns with organizational goals. This comprehensive feedback loop helps administrators optimize the labeling process, enhance data quality, and ultimately improve overall system performance.
While general accuracy metrics focus solely on the correctness of labels, the proposed metrics also consider the completeness and dominance of responses, the reliability of contributors, and the consistency of golden items. This holistic approach ensures that not only is the accuracy of individual labels assessed, but also the overall reliability and robustness of the labeling process. Finally, aggregating responses per item and distinguishing between ordinary and golden items allow for the identification of patterns and discrepancies that simple accuracy measures might miss.
6. Conclusions
This paper has elaborated on the payment mechanism and reporting framework of data labeling systems. To adopt a workable payment mechanism, we focused on customizing one of the simplest yet most reliable methods in the literature, namely, the skipped-based golden-oriented function. We showed how its rigorous penalty scheme could be moderated by substituting the coefficient of zero with a power function. The behavior of the function was studied numerically, and a sensitivity analysis performed to tune its parameters. The value of the shape parameter was selected through two metrics defined to account for the allocation of credit and intensity of penalties. Negative golden items were introduced to hedge against the credit increase in spammers and careless users.
The distribution of golden data was used to illustrate how the enumeration of Skip labels could negatively influence the interaction of users with the system. The aggregation of results was addressed by configuring a reporting framework using multiple metrics. These metrics were proposed to signal the completion, domination, and consistency status of golden and ordinary items as well as the accuracy of the labels submitted. The quality of the labels was assessed by ranking the performance of users and calculating their contribution to completely labeling the items. Finally, the default values of golden data were double checked for consistency and the proportion of labeled golden versus ordinary items was also analyzed.
Among the potential extensions of this study, software engineering-oriented practices could be defined to develop the labeling system through data models, pseudo-code, and the relationships arising across the tables of the database. As discussed throughout the paper, a cornerstone of our study focused on enhancing the incentives of users to trust the newly launched data labeling system. In this regard, surveys can be carried out to assess the importance that the satisfaction of users has for boosting the corresponding system as well as providing high-quality labels.