Next Article in Journal
Biofuel–Electric Hybrid Aircraft Application—A Way to Reduce Carbon Emissions in Aviation
Previous Article in Journal
Experimental Study on the Influence of Microwave Energy Pulse Width and Duty Cycle on Evaporation and Ignition Characteristics of ADN-Based Liquid Propellant Droplets
Previous Article in Special Issue
A Review of Training Procedures for Simulated Engine Failure after Take-Off Exercises with Twin-Engine Aircraft under 5700 kg
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Approach Using Non-Experts and Transformation Models to Predict the Performance of Experts in A/B Tests

by
Phillip Stranger
1,2,
Peter Judmaier
3,
Gernot Rottermanner
3,
Carl-Herbert Rokitansky
4,
Istvan-Szilard Szilagyi
5,
Volker Settgast
2 and
Torsten Ullrich
1,2,*
1
Institute of Computer Graphics and Knowledge Visualization, Graz University of Technology, 8010 Graz, Austria
2
Fraunhofer Austria Research GmbH, 8010 Graz, Austria
3
Fachhochschule St. Pölten Forschungs GmbH, 3100 St. Pölten, Austria
4
4D Aerospace Research and Simulation GmbH, 5020 Salzburg, Austria
5
Division of Medical Psychology, Psychosomatics and Psychotherapeutic Medicine, Department of Psychiatry, Psychosomatics and Psychotherapeutic Medicine, Medical University of Graz, 8036 Graz, Austria
*
Author to whom correspondence should be addressed.
Aerospace 2024, 11(7), 574; https://doi.org/10.3390/aerospace11070574
Submission received: 1 May 2024 / Revised: 28 June 2024 / Accepted: 10 July 2024 / Published: 12 July 2024
(This article belongs to the Special Issue Human Factors during Flight Operations)

Abstract

:
The European Union is committed to modernising and improving air traffic management systems to promote environmentally friendly air transport. However, the safety-critical nature of ATM systems requires rigorous user testing, which is hampered by the scarcity and high cost of air traffic controllers. In this article, we address this problem with a novel approach that involves non-experts in the evaluation of expert software in an A/B test setup. Using a transformation model that incorporates auxiliary information from a newly developed psychological questionnaire, we predict the performance of air traffic controllers with high accuracy based on the performance of students. The transformation model uses multiple linear regression and auxiliary information corrections. This study demonstrates the feasibility of using non-experts to test expert software, overcoming testing challenges and supporting user-centred design principles.

1. Motivation

In today’s push towards climate neutrality, the aviation industry is at a crossroads of innovation. The European Union has set itself the goal of “modernising and improving air traffic management technologies, procedures and systems” [1] to make air travel more efficient and environmentally friendly [2]. However, this progress must also ensure the highest safety standards from the very beginning. This requirement makes extensive testing in the software development process essential. At the heart of this testing landscape is the involvement of air traffic controllers (ATCs) themselves, whose expertise ensures that the software meets operational realities and end-user needs. However, this critical need for extensive user testing presents a major problem: the scarcity and high cost of readily available ATCs is a significant barrier to achieving the required test volume. The process of software prototyping, from design prototypes to functional prototypes to pilot systems, requires an ever-increasing number of tests. However, these numbers often exceed the availability of ATCs, in terms of both financial feasibility and organisational logistics. A lack of ATCs for user testing, whether due to organisational constraints or financial factors, limits the scope of testing and consequently reduces the depth of user feedback. This reduction in user feedback not only increases the deviation from user-centred development but also increases the risk of overlooking critical user perspectives in the software development lifecycle. To counter this risk, an attractive solution is to broaden the testing pool by including individuals from outside the air traffic management (ATM) domain. The advantages of this approach are obvious: the pool of test subjects can be expanded and is not limited by the availability of air traffic controllers; moreover, any lack of representativeness in terms of age, gender, etc., can be compensated for more easily if the sample pool is larger. Unfortunately, the most important disadvantage is also obvious: it is no longer the target group that is being tested.
The main goal of the study is to take advantage of the benefits of an extended user group without accepting or at least minimizing its disadvantages. This article describes an approach that makes it possible to partially replace experts with non-experts in A/B testing and to exploit the advantages (see Figure 1) without having to accept the disadvantages. Specifically, this article answers the following research questions:
  • Is it possible to perform a meaningful user test without the relevant user group?
  • How large is the error caused by using the wrong user group and how can it be minimized?
  • If the relevant user group is omitted (i.e., no ground truth is available), can the error still be quantified?

2. Related Work

The EuroControl white paper on human factors highlights that current ATM systems are primarily designed from a functional perspective and focus on presenting a specific set of data to users. However, as Perott et al. note, the presentation of these data often follows a technical rather than a user-centred perspective [3]. As a result, EuroControl advocates a shift towards the user-centred design of ATM systems.

2.1. User-Centred Design

The user-centred design process is a highly iterative approach aimed at rapid prototyping and evaluation to ultimately develop a system that meets user requirements [4]. Research by König et al. demonstrates the suitability of this approach for ATC interface design, as they applied the process to create a planning tool tailored to ATC [5]. Evaluation plays a central role in user-centred design processes [6,7,8,9] and represents one of the four phases of the design process [4]. Rubin and Chisnell stress the importance of focusing on users and tasks at an early stage, especially in iterative testing [7]. Similarly, the EuroControl white paper on human factors emphasises the importance of prototyping and evaluation within the iterative design process [3].

2.2. Usability Testing

The evaluation phase of the user-centred design process requires usability evaluation methods to assess the current system. Usability testing involves using real users to test a specific system [7,10,11], with the main objective, as defined by Dumas and Redish, being to improve the usability of the product [12]. Dillon suggests that conducting tests on an application with a group of users performing specific, pre-defined tasks is widely regarded as the most accurate and reliable method for assessing the usability of the application [13]. In addition, Dumas and Redish point out the broad applicability of usability testing in different domains and product types, with test procedures being tailored to the particular context [12]. A comprehensive review by Sagar and Saha highlights usability testing as a prominently used usability evaluation method and covers usability standards, evaluation methods, metrics, and application domains [14].
In practice, usability testing typically involves users performing pre-defined task scenarios, followed by questionnaires or surveys to gather users’ opinions or relevant information [15]. For example, in the Bos et al. study, air traffic controllers tested a prototype of an electronic flight strip system. Here, ATCs tested the prototype in two traffic samples, and after each run with the prototype, they completed a questionnaire to evaluate the prototype [16]. In addition, Bos et al. mention that for evaluation purposes, debriefing sessions were held and analyses of simulator logs and video recordings were conducted. Similar methods were used by Huber et al. [17], where ATCs tested prototypes and provided feedback via questionnaires to evaluate interface and interaction concepts.

2.3. A/B Testing

While usability evaluation methods such as usability testing are used to assess a specific system, quantifying the effects of design adjustments requires data-driven methods, of which A/B testing is one of the most common [18]. A/B testing is a method used to evaluate user experience by conducting controlled experiments in which users are randomly exposed to different variants of a service or product [19,20]. Although A/B testing typically involves two variants, it should be noted that any number of variants can be tested, and with a well-designed experiment, the best-performing variant can be identified. As described by Quin et al., A/B testing tests hypotheses in live software systems, with the end users being the participants in the experiment [21]. The hypotheses in this context represent variants of the software system being tested, and the metrics resulting from the A/B test can be used to identify the more user-friendly variant.
A/B testing is widely used in various domains, especially in web, search engine and e-commerce applications. In the web sector, it is mainly social media platforms and news publishers that use A/B testing methods [21]. For example, Hagar and Diakopoulos [22] conducted an interview study examining how newsrooms use A/B testing to select optimal headlines and increase traffic to articles. Other examples include the Wikipedia Foundation, which uses A/B testing to optimise a wide range of aspects [23,24,25].

2.4. Sampling and Error Correction

The results of statistical testing methods are highly dependent on the quality of the underlying data and the sampling technique used. Errors in the data or inadequate sampling procedures can lead to inaccuracies in the test results, requiring the application of statistical correction methods.
Sampling error is a major source of error in statistical testing methods. As defined by Milanzi et al., sampling error is “generally defined as the difference between the actual value of the population characteristic and an estimate obtained from a sample. This estimate is generally not equal to the true value of the characteristic because of sampling variability […] and bias” [26]. To reduce sampling error, advanced sampling techniques such as stratified sampling are often used. Stratified sampling involves dividing a population into smaller, homogeneous groups called strata. These strata are organised on the basis of characteristics or attributes shared by members of the population [27]. This division helps to prevent the inclusion of extreme samples that may skew the results [28]. In test design, each stratum of a stratified random sample is usually modelled separately to ensure accurate representation. For example, in surveys, strata can be defined based on demographic characteristics such as age, and the sample size for each stratum is determined independently of the survey according to the corresponding age group of the population. An alternative approach to stratification has been proposed by Liberty et al. They use machine learning and regression analysis to address the problem of stratification design [29].
Another effective strategy for reducing sampling error is the use of auxiliary information. Bethlehem notes that auxiliary information can improve both the sampling design and the estimation procedure itself [27]. Bethlehem goes on to provide a comprehensive overview of survey methods, including sampling design, estimators, and the use of auxiliary information to reduce error and bias. Early studies by Raiffa and Schlaifer [30] and Ericson [31] explored the use of auxiliary information in stratified sample surveys. More sophisticated approaches include the use of auxiliary information for two-stage sampling [32] and for determining an optimal compromise allocation of sampling units in multivariate stratified surveys [33]. Building on these foundations, Khan et al. [34], Varshney et al. [35] and Gupta et al. [36] extended the use of auxiliary information to obtain integer optimal solutions. In addition, Deville and Särndal [37] proposed calibration estimators in survey sampling, using auxiliary information to improve the estimation of population statistics. In subsequent work, Singh et al. [38] proposed a calibration approach for improved variance estimators in survey sampling, while Kim et al. [39] proposed various ratio estimators in the calibration approach and Wu and Sitter [40] used auxiliary information in a model calibration approach.

3. A/B Test Setup

The new approach is applied to a test configuration that corresponds to the classic A/B test with experts. In order to control as many factors as possible in the new approach, an A/B test that has already been successfully performed, documented and published in a previous project will be repeated: a comparison of an ATC software (4D-NAVSIM, version 2023; VAST, version 4.14 based on Unity 2019) user interface in 2D and in 3D [41,42]. The setup consists of a prototype, the result of previous efforts [41,43,44,45], coupled with an existing air traffic simulator [46], which enables realistic air traffic control simulations.
The test involved 28 participants, including eight ATCs (one female, seven male) and twenty students (seven female, thirteen male) with experience in 3D video games. The ATCs work at an international, Austrian Airport, while the students were enrolled in media technology or computer science programmes at the University of Applied Sciences St. Pölten and Graz University of Technology respectively.

3.1. Test Setup and Protocol

The test setup and protocol closely follow those of the previous “Virtual Airspace and Tower (VAST)” project [42]. The tests were conducted in dedicated environments, with ATCs being tested in Salzburg and students being tested at their respective universities. To facilitate a smooth experimental scenario, the test setup consisted of a PC with a powerful GPU, a 4K monitor for the prototype, and standard peripherals. In addition, the air traffic simulator (ATS) ran on separate hardware, and interaction with the traffic simulator was facilitated by voice control via a headset with a microphone.
After a general introduction to the test setting, participants completed a newly developed psychological questionnaire, which was later used as auxiliary information for statistical correction. In a training phase, participants were then free to explore the prototype. Subsequently, as in Rottermanner et al. [42], two test scenarios—Task 1 (2D) and Task 2 (3D)—were performed for 20 min each, with participants using voice control to manage air traffic. The objectives mirrored those of Rottermanner et al. [42], focusing on efficient and safe aircraft landing with a test scenario based on data from Frankfurt airport. As all ATC participants work at an Austrian airport, Frankfurt Airport ensures that all participants are confronted with an unknown air traffic control scenario and environment.
The tasks also remained unchanged; i.e., in Task 1, the 2D task, participants were restricted to an aerial (bird’s eye) view of air traffic, while in Task 2, the 3D task, participants were allowed to adjust the viewing angle within a specified range, excluding the aerial option. As in Rottermanner et al. [42], the NASA Task Load Index (NASA TLX) [47] to assess workload and the Situational Awareness for SHAPE questionnaire (SASHA_Q) [8,48] to assess situational awareness were completed by the participants after each task.

3.2. Flight Data

Similar to VAST, the test used real-time flight data from Frankfurt Airport to ensure that participants were exposed to a complex and realistic air traffic control scenario. The data, recorded over one day, included departing and arriving air traffic and were used at four different start times for different scenarios. One scenario was used for training, two were used for the test tasks and one was used as a backup, with all scenarios falling within the 12 pm (noon) to 2 pm time window. This approach prevented participants from anticipating flight behaviour in subsequent tasks [42].

3.3. Performance Measures

During each task, several performance measures were tracked, including the number of aircraft taken over, the time to take over, the number of landings, the deviations from simulation-based optimised routes and landing times, the altitude and distance of unlanded aircraft, the conflicts and the instructions given. These measures were combined to create task-related key performance indicators (KPIs) for each participant. As the simulated ATS traffic was taken as the optimal case, the subjects’ performance measures were related to the simulated performance of the ATS. Table 1 lists all key performance indicators.

4. Statistical Error Correction

The basic idea of the new approach is to deliberately introduce a systematic statistical error into the study and then correct it. Under normal circumstances, it is not a good idea to conduct a user test with the wrong target group. However, if the target group is difficult to reach, it may make sense—not for statistical reasons, but for economic, organisational or other reasons—to deliberately introduce this error and then correct it.
The essence of this study is to involve non-domain individuals in the process of testing expert software. To achieve this, the non-domain individuals need to be mapped into the domain of the domain experts. By using auxiliary information, the approach aims to minimise the introduced error of testing expert software with non-domain individuals.
The approach can be easily illustrated for better understanding. Figure 2 provides a visual representation of the main idea of the approach. Basically, the approach aims to construct a model that facilitates the transfer of test results from non-domain experts to domain experts by using auxiliary information. In Figure 2, domain experts are denoted as A T C i and non-domain individuals are denoted as S j . Both non-domain individuals and experts are assessed using a single task (Task 1) and a psychological questionnaire that serves as auxiliary information. A linear model is then developed to establish the relationship between the Task 1 results and the auxiliary information of each domain expert ( A T C i ) on the one hand and the Task 1 results and the auxiliary information of all non-domain individuals on the other hand. This model consists of a weight vector for each expert; each vector contains the weights to optimally represent an individual expert by non-experts in terms of a linear regression model. Consequently, the model can be applied to the Task 2 performances of the non-experts to predict the Task 2 KPIs of the experts.
In an actual application scenario, the tests would now be completed (and the controller testing effort saved for Task 2), but in order to not only statistically prove but also clearly demonstrate the accuracy of the predictions, the controller test results are also recorded in Task 2 and compared with the model predictions.
In summary, Task 1 scores are used in conjunction with auxiliary information to create a linear mapping model from non-domain individuals to the expert. The Task 2 scores of the non-domain individuals, together with their auxiliary information, are then used to predict the Task 2 scores of each domain expert.

4.1. Auxiliary Information

In this new approach, auxiliary information is used to counteract the introduced systematic error in the mapping of non-experts to experts. A psychological questionnaire is used as the auxiliary information. In order to create the most suitable psychological questionnaire for providing auxiliary information in the novel mapping process, a series of workshops were conducted with psychologists to define the requirements of the auxiliary information.
A number of characteristics were considered essential to the mapping process for the auxiliary information questionnaire:
  • For procedural reasons, the questionnaire should not provide free text fields for responses but should only allow responses on a numerical scale or be directly mappable to such a scale. Furthermore, as the questionnaire was to be included in a user test, it was imperative that the test could be completed within a limited time (in this case 45 min).
  • The test had to cover a wide range of ATM or ATM-related topics without being too specific, as it was intended to be auxiliary information. If the test was too specific (e.g., a question that all ATCs answered in the same way), the information value of the question would be low; if all non-experts also answered in the same way, the information value would be non-existent. From a statistical point of view, the answers to the questions should ideally have a normal distribution for both the experts and the non-experts. The additional information is not used to select study participants who match the requirement profile of air traffic controllers as closely as possible; participants with a negative correlation to the requirement profile (laypersons who, in extreme cases, do the opposite of professionals) also provide valuable information.
  • Psychological interpretation of the psychological test results was not required for the purposes of this study; i.e., it did not have to be a validated psychological test. The aim is not to create personality or character profiles, and although the tests are conducted anonymously, the questionnaire should not contain any questions that could be ethically or legally problematic.
  • Aspects already covered by the KPIs, in particular the workload and situational awareness questionnaires used, should not be included in this psychological test.
Following several sessions with multiple psychologists, a consensus was reached. The final questionnaire emphasizes various aspects crucial for successful performance in the ATC profession. These encompass personality traits, such as decisiveness, responsibility and teamwork skills, as well as stress management and processing, concentration, cognitive abilities, intelligence and work ethic [49,50]. The questionnaire consisted of 75 questions. Each question was tailored to focus on specific aspects. Questions focusing on the personality traits aspect are based on the Big Five model [51], which includes the five dimensions: surgency, agreeableness, conscientiousness, emotional stability and intellect. For example, questions #1 “I tend to be spontaneous.”, #25 “I have a passion for collecting.” and #32 “I love rituals.” are taken from the psychological questionnaires in the categories of personality traits (#1), stress management and processing (#25) and work ethic (#32). In addition to cognitive and perceptual skills, there are questions designed to assess concentration. The entire questionnaire can be found in Appendix A. It also comprehends two tests (see Appendix A.2 and Appendix A.3). As each test is weighted in the same way as each of the 75 questions, the two tests play a minor role. Since the influence of the tests (as well as the individual questions) is an open research question, we opted for more questions and fewer tests due to the time constraints of the complete A/B test setup.

4.2. Transformation Model

As illustrated in Figure 2, each participant is represented by task scores combined with auxiliary information; Specifically, the data for each participant consisted of 26 KPIs, 6 NASA TLX scores, and 8 SASHA_Q scores. The auxiliary information included 77 scores, of which 75 scores were from the psychological questionnaire and 2 scores were from the additional psychological tests focused on assessing concentration, cognitive and perceptual abilities. Combining the task results and the auxiliary information resulted in 117 values per participant. An overview of how the samples are split into the respective components is given in Table 2.
Due to the different ranges of the KPIs and questionnaire responses, normalisation was required. All 117 samples were normalised to the interval between zero and one using the equation
x n o r m = x min max min .
For continuous variables, such as the KPIs, min and max refer to the minimum and maximum across all tasks and subjects for the specific variable. For discrete variables, such as the questions of the psychological questionnaire, the NASA TLX or the SASHA_Q questionnaire, min and max refer to the minimum and maximum allowed values for the questionnaire. In addition, continuous variables were padded by 10% of their respective min-max range.
The model itself is based on multiple linear regression (MLR) that is carried out with p = 19 independent variables; one independent variable per student, with one student removed due to incomplete test results. If Y i is the score vector of the ATC i (with 117 dimensions as listed in Table 2) and X j is the score vector of the non-expert student j, then the MLR model consists of the weights β i , j and the errors ε i according to the equation
Y i = j = 1 19 β i , j · X j + ε i
In general, Equation (2) cannot be solved because it is overdetermined. This is exactly the purpose of auxiliary information. Instead of an exact solution, which is not desirable for numerical reasons and not expected for modelling reasons, a least squares approximation is used. Normal equations and Cholesky decomposition give least squares estimates for the student weights β ^ i j and the offsets ε ^ i , ( i = 1 , , 8 and j = 1 , , 19 ).
The predictions are now calculated by multiplying the results of Task 2 of the non-expert students by the previously calculated weights and adding them together to predict the results of each individual expert.
As the predictions are calculated on normalised data and are therefore in normalised form, denormalisation must be applied. Denormalisation is the reverse process of normalisation and is achieved with the following equation:
x d e n o r m = x n o r m · ( max min ) + min ,
where min and max are the same minima and maxima used in the normalisation process.
The quality of fit of the standard MLR models is assessed by the coefficient of determination R 2 . This coefficient, introduced by Wright [52], generally indicates how well the regression model explains the data. More specifically, R 2 can be interpreted as the proportion of variance in the data that is explained by the regression model. Thus, an R 2 value of 0.75 would indicate that 75% of the variance in the data can be explained by the regression model.
The entire transformation model can be evaluated using the quality of fit using the coefficient of determination; for predictions based on such a model, confidence intervals are provided by Olive [53]: the 100 ( 1 δ ) % confidence interval for a prediction y ^ i is calculated via
y ^ i ± t n p 1 , 1 δ 2 σ 2 1 + x i T ( X T X ) 1 x i
using the t-distribution, the estimated variance σ 2 of the errors ε i , and the input values x j .

5. Results

To illustrate and demonstrate the new approach, we repeated an A/B test of an earlier user study involving air traffic controllers.

5.1. “Virtual Airspace and Tower”

In the specific example of repeating the user interface A/B test from the previous “Virtual Airspace and Tower (VAST)” project [42], the application of the new method is as follows: Task 1 and the psychological test (auxiliary information) were completed by both the expert ATCs and the non-expert students. After the values were normalised, the model parameters were determined using the normal equation and the Cholesky decomposition. Table 3 shows the model parameters. This table also includes statistics such as the minimum (min), maximum (max), mean, standard deviation (std.-dev.), and variance of the weights ( ε ^ , β ^ 1 , β ^ 2 , …, β ^ 19 ) for each model.
Inspection of the Table 3 reveals a visually uniform distribution of weights in the range [ 0.5 , 0.5 ] with no gross outliers, although no range has been enforced by any constraints. The minimum weight, β ^ 10 = 0.43356191 , corresponds to ATC 5, while the maximum weight, β ^ 14 = 0.47690763 , belongs to ATC 7. Since the selection of non-experts is not limited to people who are as similar as possible to the experts, negative weights also occur. This may lead to invalid values in the prediction and extrapolation of future test results, but it does not restrict the selection of non-experts in any way: an advantage that may justify a possible extrapolation error that does not necessarily occur. If this is not desired, non-experts with negative coefficients—such as β ^ 10 —should be removed.
In statistics, the coefficient of determination R 2 is used to determine the quality of fit of a model. Specifically, R 2 is the proportion of variation in the dependent variable that can be predicted by the independent variables. In this way, it provides a measure of how well the observed results are replicated by the model, based on the proportion of total variation in the outcomes explained by the model. Table 4 shows how well each ATC’s test results can be described by the model of non-experts.
Pearson’s correlation coefficients are calculated between the dependent variable y and the independent variables ( x 1 , x 2 , , x 19 ), denoted as r y , x 1 , r y , x 2 , , r y , x 19 . In addition, the correlations between the independent variables themselves are calculated ( r x 1 , x 2 , r x 1 , x 3 , …, r x 18 , x 19 ). The correlation matrix illustrates these coefficients (see Figure 3), where the first column of the correlation matrix shows the correlations between the dependent variable y and each independent variable, while the remaining columns show the Pearson correlation coefficients between all independent variables.
In Figure 3, the highest correlation between the dependent variable y and the independent variables can be seen for x 17 with r y , x 17 = 0.57 . Furthermore, x 3 and x 14 have correlations with the dependent variable greater than 0.5. Notably, x 10 is the only independent variable that has a negative correlation with y as r y , x 10 = 0.15 . Among the independent variables, the highest correlation coefficient is observed between x 12 and x 15 with r x 12 , x 15 = 0.66 . Other independent variables with correlation coefficients greater than 0.6 include r x 2 , x 12 = 0.6 and r x 3 , x 9 = 0.61 ; the only negative correlation is observed between x 10 and x 13 with a value of r x 10 , x 13 = 0.046 . The five smallest correlations in absolute terms are (in decreasing order) r x 3 , x 5 = 0.071 , r x 9 , x 10 = 0.069 , r y , x 5 = 0.068 , r x 10 , x 13 = 0.046 , and r x 3 , x 10 = 0.034 .
The model listed in Table 3 is used to transform the results of Task 2 from the non-expert students to the expert ATCs.

5.2. Transformation Results

The results of the transformation are summarised and listed in Table 5. To illustrate the quality of the transformation, the ATCs also performed Task 2 (observation), and these averaged results are compared with the averaged predictions using the transformation model (prediction) including and excluding the correction using auxiliary information. To facilitate comparison between the KPIs, the relative errors of the normalised values (according to Equation (1)) are also given. As the relative errors depend on the size of the range interval, i.e., the minimum and maximum values of all test results by ATCs and non-ATCs, the listed percentages are sensitive to outliers. Nevertheless, it makes sense to normalise the data in order to be able to compare the error values of the individual categories, which can differ by orders of magnitude.
The transformation model deliberately allows for negative coefficients (see Table 3); if all non-experts with negative weights had been removed (as discussed above), the number of subjects would have been significantly reduced. Only 5 of the 19 non-experts have consistently positive weights. As already mentioned, this increases the likelihood of semantically unreasonable values in the extrapolation/prediction (e.g., a negative prediction when in reality only semi-positive values are meaningful and possible). Nevertheless, the transformation model is convincing. It shows improvements over models without auxiliary information. The improvement column lists the average improvement (reduction in errors) in percentage points of the relative errors through the use of auxiliary information. The use of auxiliary information improves the prediction results by reducing the error by 12% on average.
In the intended application scenario of the transformation model—replacing unavailable or difficult-to-reach air traffic controllers in the test with an alternative target group for cost and/or organisational reasons—the real observations are not known. The proposed interpretation of an A/B test prediction can be based on the confidence intervals (see Equation (4)): In an A/B test setting, the relevant question is whether version A or version B is better. If the test results (KPIs) of the ATCs in Task 1 t 1 and their prediction for Task 2 t 2 differ, the confidence interval t 2 ± c o n f ( δ ) can be determined depending on the confidence level δ in such a way that a separation t 1 t 2 ± c o n f ( δ ) with maximum delta is ensured. This view allows the test question to be answered in terms of how confident you can be that one version (A or B) is better than the other and that the test result is not random. Such a representation is shown in the appendix in Table A1 and Table A2.
The results presented in “Design and Evaluation of a Tool to Support Air Traffic Control with 2D and 3D Visualizations” [42] could not be reproduced completely; in this repeated study, the A/B Test showed significant differences between Task 1 (2D) and Task 2 (3D) according to Mann–Whitney-U-tests only for
  • Distance not landed/plane % [ U = 59 , p-value = 0.003 ],
  • Distance not landed total (km) [ U = 7 , p-value = 0.007 ],
  • Distance not landed/plane (km) [ U = 8 , p-value = 0.010 ].
Inspecting the KPIs in the Table A1 and Table A2 reveals four KPIs showing high-confidence percentages across all models, namely “Landings 2”, “Distance not landed total (km)”, “Distance not landed/plane (km)” and “Distance not landed/plane (%)”.
Unfortunately, this study suffers from the same problem that it seeks to solve: the statistical tests could not be carried out to the necessary extent with ATC subjects. Despite the severe limitation of having only eight ATC participants, the transfer model was able to show that the essential statements of the A/B test could be generated with the non-expert students.

6. Conclusions

The aim of this new approach was to test the feasibility of involving non-experts in the evaluation process of expert software, focusing specifically on whether a transformation model could be constructed to predict test results for ATCs using students’ test results. Using auxiliary information in the form of a newly developed psychological questionnaire, we constructed a novel transformation model from the students’ Task 1 results to the Task 1 results of each ATC. We then predicted Task 2 results for each ATC based on the students’ Task 2 results.
Using multiple linear regression to create the transformation model, we achieved accurate predictions for the majority of the defined KPIs for Task 2 for the ATCs using the students’ Task 2 performance. In other words, the first research question, whether it is possible to perform a meaningful user test without the relevant user group, can be answered in the affirmative. The errors of the averaged predictions were generally small, with the majority of KPIs showing errors of less than 10% and all KPIs showing errors of less than 30%. The examination of the quality of fit revealed coefficients of determination between 45% and 66%. On average, the coefficients of determination resulted in 54.8% of the variance in the dependent variable being accounted for by the independent variables, underlining the predictive power of the approach. This example answers the second research question about the expected errors.
The selection of the questionnaire remains an open question; to the best of our knowledge, we suspect that the questionnaire is only dependent on the field of application (air traffic management). This is an example of constructive error correction for error minimization. However, further research is needed to confirm this hypothesis. Furthermore, the number of auxiliary questions is an open research question. On the one hand, a comparison of models with and without auxiliary information indicates that the prediction improves when some auxiliary information is used. In our example, the prediction improved by 12% on average (see Table 5). On the other hand, the auxiliary information and the KPIs to be predicted are part of the same transformation model. As the number of auxiliary questions increases, the impact of the KPIs on the transformation will diminish, potentially reducing the prediction accuracy. The optimal number of additional questions and tests is unknown and remains an open research question. Furthermore, all questions and tests are used with a uniform weight in the transformation model, despite the possibility that some questions may be more important than others. It is also unclear which questions are the most important ones.
A notable restriction of this study was the limited size of the test pool. Ideally, the proposed approach would be validated with a larger pool involving more ATCs and students. However, this is limited by the availability of ATCs for testing purposes—the very limitation that this approach aims to alleviate. Even if the error cannot be avoided, it can at least be limited by confidence intervals; i.e., you do not have to blindly trust the transformation model. This answers the third research question about error quantification.
In summary, our results highlight the potential of the presented approach to improve the evaluation process of expert software by involving non-experts in the testing phase. By developing and validating a novel transformation approach that incorporates auxiliary information from a newly developed psychological questionnaire, we have demonstrated the ability to predict the performance of ATCs based on students’ test scores. The approach allows testing with non-experts, while ATCs are only needed at the beginning to build the transformation model. However, as shown in Figure 1, we recommend involving experts in testing at key milestones and, at the end of the software development process, validating the end result.
The new approach not only avoids the challenge of obtaining a sufficient number of ATCs for testing but also increases the frequency of testing while ensuring that a wider range of perspectives are incorporated into the evaluation process. With this approach, more tests can be performed for the same financial value, resulting in better-tested and more user-centred software, in line with the push for user-centred design by EuroControl [3].

Author Contributions

Conceptualization, I.-S.S. and T.U.; Data curation, P.S., P.J., G.R. and V.S.; Formal analysis, T.U.; Funding acquisition, T.U.; Investigation, P.S., P.J., G.R., C.-H.R., V.S. and T.U.; Methodology, T.U.; Project administration, V.S.; Resources, P.S., P.J., G.R., C.-H.R., I.-S.S. and V.S.; Software, P.S., P.J., G.R., C.-H.R. and V.S.; Supervision, T.U.; Validation, P.S., P.J., G.R., C.-H.R., I.-S.S., V.S. and T.U.; Visualization, P.S., G.R., C.-H.R., V.S. and T.U.; Writing—original draft, P.S., P.J., G.R., C.-H.R., I.-S.S., V.S. and T.U.; Writing—review and editing, P.S., P.J., G.R., C.-H.R., I.-S.S., V.S. and T.U. All authors have read and agreed to the published version of the manuscript.

Funding

This article is supported by TU Graz Open Access Publishing Fund. The presented research results are based on the previous research projects “Virtual Airspace and Tower” (VAST), “Sondierung eines Prototyping & Evaluation Frameworks für (teil-) automatisierte Air Traffic Control Software” (EMMSA), and “Cross Level User Evaluation” (CLUE).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

Open Access Funding by the Graz University of Technology.

Conflicts of Interest

Authors Phillip Stranger, Volker Settgast and Torsten Ullrich were employed by the company Fraunhofer Austria Research GmbH, Peter Judmaier and Gernot Rottermanner were employed by the company Fachhochschule St. Pölten Forschungs GmbH, Carl-Herbert Rokitansky was employed by the company 4D Aerospace Research and Simulation GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Auxiliary Information

Appendix A.1. Questionnaire

The following questions were presented to the test participants with the answer options of agreement and disagreement according to the scale:
  • not true at all
  • do not agree
  • rather disagree
  • disagree a little
  • neither
  • somewhat agree
  • rather agree
  • quite true
  • very true
  • completely agree
The original questionnaire was written in the German language; the following questions are a translation that is as close to the original as possible:
  • I tend to be spontaneous.
  • I enjoy getting to know other people.
  • I enjoy giving presentations to large groups.
  • I prefer to create a cosy seclusion at home rather than going out and socialising.
  • I am an optimist.
  • In my work, I try to plan ahead as much as possible.
  • I face challenges with optimism.
  • I prefer to solve problems at work independently rather than as part of a team.
  • My favourite job is one where I can take on a high level of responsibility.
  • I really enjoy monotonous professional activities.
  • I usually make my decisions impulsively and on instinct.
  • I adapt my work activities immediately according to the situation at hand.
  • I am easily persuaded by others.
  • I like to be the centre of attention.
  • I always work purposefully to achieve my work results.
  • To cope with more difficult tasks, I seek the approval of a colleague to be on the safe side.
  • Mastering an unfamiliar professional task causes me discomfort and anxiety.
  • If necessary, I can assign clear tasks in a work context.
  • I can concentrate on monotonous tasks over a longer period of time.
  • I am depressed after challenging tasks at work.
  • I am able to communicate easily in stressful situations.
  • A stressful job is unimaginable for me.
  • People who have achieved more professionally than I have are enviable.
  • I am very resilient in my job.
  • I have a passion for collecting.
  • It makes me uncomfortable if I don’t have a situation under control.
  • I’m good with numbers.
  • I can relax after strenuous activities with exercise.
  • I find some traffic rules nonsensical.
  • I don’t follow rules that don’t make sense to me in certain life situations.
  • Standardised work processes are important to me.
  • I love rituals.
  • I see stressful situations as a kind of obstacle for me.
  • I see challenging situations as an opportunity.
  • My abilities unfold in situations that trigger stress.
  • Complex work situations should be dealt with as part of a team.
  • I am able to recognise patterns and structures in certain situations or activities where others do not see them.
  • I relax when I do sport.
  • Music is a form of relaxation for me.
  • I have to work to earn a living, but I wouldn’t do it if I didn’t have to.
  • I enjoy learning something new.
  • I take regular breaks from strenuous activities.
  • I am a creative person.
  • I play at least one musical instrument well.
  • I put other people’s needs before my own.
  • I avoid conflicts.
  • I often forget what I wanted to do a few minutes ago.
  • I get angry quickly if something doesn’t fulfil my wishes.
  • I’m not allowed to show emotions at work.
  • Sometimes I tend to let my feelings run wild.
  • I have suffered from illnesses for no apparent reason.
  • I tend to carry out tasks quickly, but with mistakes
  • I always stand behind the decisions I make.
  • I have high expectations of myself.
  • It is very important to me that I am always committed.
  • I can change work steps quickly if necessary.
  • I often experience the feeling of losing control in my everyday life.
  • It wouldn’t be a problem for me to work a lot of overtime.
  • In difficult situations, I take a solution-orientated approach.
  • I don’t want others to realise when I can’t do something.
  • I like working alone.
  • I can easily prioritise my work.
  • I can reduce stress by using relaxation techniques.
  • I am able to concentrate on work processes despite a heavy workload.
  • I take my anger out on bystanders.
  • As soon as I get too stressed at work, I take a coffee or smoke break to relax again.
  • I find it very easy to listen.
  • If necessary, I can easily manage a clear division of tasks.
  • I find it very difficult to make a short-term decision under great pressure.
  • Treating colleagues respectfully and appropriately in the workplace is not particularly relevant to me.
  • A job where you have to speak English is out of the question for me.
  • I am able to empathise with the feelings and sensitivities of another person.
  • After a stressful day, I prefer to relax with my family or friends.
  • I can’t switch off after a stressful day.
  • I am very good at dealing with criticism.

Appendix A.2. Psychological Test #1

Indicate the frequency of occurrence of the target motif by marking (crossing out) the target motif. You have 20 s to complete this task.
Aerospace 11 00574 i001

Appendix A.3. Psychological Test #2

Please identify and mark (using a highlighter!) all “ä” letters in a maximum of 20 s. Make sure you do not make any mistakes and process as many correct characters as possible.
Aerospace 11 00574 i002

Appendix B. Detailed Transformation Results

Table A1. Thetransformation model is able to calculate a prediction of the result for each KPI and for each ATC. To interpret the result—to decide which version is better in an A/B test—the confidence level is determined that the test result prediction of one version being better than the other is not a coincidence.
Table A1. Thetransformation model is able to calculate a prediction of the result for each KPI and for each ATC. To interpret the result—to decide which version is better in an A/B test—the confidence level is determined that the test result prediction of one version being better than the other is not a coincidence.
KPIT1 Obs.ATC 1ATC 2ATC 3ATC 4
Mean T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. %
Taken over #10.1259.5570.49120.09.5610.5125.09.0891.03345.09.6480.3310.0
Taken over %0.920.8690.04520.00.8690.04625.00.8260.09445.00.8770.0310.0
Time until takeover total209.625432.808189.57920.0411.18196.76725.0408.771168.42420.0738.025461.48435.0
Time until takeover/plane20.7550.3828.6525.043.4918.7820.048.03825.45325.088.28364.10840.0
Landings 14.53.6990.75535.03.7770.72140.03.6320.77640.04.5820.141<1.0
Landings 20.1250.6420.48765.00.4370.28850.00.430.27545.01.1711.03485.0
Landings 31.51.1670.32520.00.8290.63545.01.4650.071<1.01.9830.44120.0
Calculated Landings4.9384.240.62425.04.1480.73435.04.1350.7935.05.5510.50215.0
Optimum Landings0.90.740.15135.00.7550.14440.00.7260.15540.00.9160.028<1.0
Calculated Optimum Landings0.8980.7190.16235.00.7010.17745.00.7030.1945.00.9450.035.0
Time deviation to landing total−223.25110.315321.60970.045.174265.27770.0109.627318.14675.0−51.637160.29530.0
Time deviation to landing/plane−48.87554.504100.05555.06.06149.34335.051.51599.20960.0−32.40411.1865.0
Distance deviation to landing total−4.25918.13520.77340.09.17112.5730.015.23318.45540.03.8556.73110.0
Distance deviation to landing/plane−0.8416.8947.25940.01.2631.4310.05.4835.57535.0−1.3891.174<1.0
Height not landed total42,327.056,172.60213,038.3365.050,691.3147719.96550.051,405.6618314.94250.045,807.7722352.33810.0
Height not landed/plane6512.2087489.741924.83445.07061.755490.42230.06977.802436.54725.06781.833262.6110.0
Distance not landed total101.135194.7790.12175.0180.96874.33675.0186.31380.06575.0120.00113.13310.0
Distance not landed/plane15.50125.3679.35980.024.3378.7285.024.2888.31580.017.8041.83515.0
Distance not landed/plane %1.090.7470.31670.00.7490.32580.00.7680.31375.01.060.0265.0
Conflicts0.1251.7841.49435.0−0.0320.17<1.01.7261.53540.02.6982.34440.0
Instructions/plane5.0574.7580.25910.04.5150.43220.04.6310.34715.04.6570.35210.0
Instructions total51.12545.3745.56125.043.5167.56640.042.7458.14940.043.1667.54525.0
NASA TLX Average [0, 100]45.72955.1327.6820.055.3247.97125.054.8798.58625.058.5910.41920.0
NASA TLX Average %0.5430.5280.019<1.00.4810.04815.00.4890.05215.00.5590.026<1.0
SASHA Q Average [0, 5]3.5783.440.12220.03.3720.1835.03.4040.16530.03.8070.20825.0
SASHA Q Average %0.7160.6880.02420.00.6740.03635.00.6810.03330.00.7610.04225.0
Table A2. Continuation of Table A1.
Table A2. Continuation of Table A1.
KPIT1 Obs.ATC 5ATC 6ATC 7ATC 8
Mean T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. %
Taken over #10.1259.740.31210.09.7870.27610.09.2990.67120.09.840.27810.0
Taken over %0.920.8850.02810.00.890.02510.00.8450.06120.00.8950.02510.0
Time until takeover total209.625602.965370.13530.0384.841160.48515.0676.498464.45335.0515.786272.44225.0
Time until takeover/plane20.7570.21444.45330.041.67419.27415.076.20247.33230.059.72132.7225.0
Landings 14.53.4750.96835.04.1250.35615.03.6060.87530.04.0950.35915.0
Landings 20.1250.7910.62565.00.9890.76580.00.8110.66565.01.0970.8785.0
Landings 31.50.9570.52425.01.550.091<1.00.550.92140.01.3560.0925.0
Calculated Landings4.9383.9550.96830.04.8830.139<1.04.070.85225.04.8840.14<1.0
Optimum Landings0.90.6950.19435.00.8250.07115.00.7210.17530.00.8190.07215.0
Calculated Optimum Landings0.8980.6860.20835.00.8350.05110.00.6920.18830.00.8310.05110.0
Time deviation to landing total−223.25−53.14151.51330.06.59208.49845.0−2.038219.91240.0−50.365159.02735.0
Time deviation to landing/plane−48.875−23.23521.19210.00.53147.59225.0−12.77233.96815.0−22.50718.87410.0
Distance deviation to landing total−4.2593.7646.36210.08.2311.35520.07.75110.19815.01.8135.66610.0
Distance deviation to landing/plane−0.8411.5422.22310.02.1462.96115.01.7862.36710.0−0.7480.988<1.0
Height not landed total42,327.048,251.0715644.1925.051,279.2318235.99140.054,466.84111,311.12645.049,827.2017168.13535.0
Height not landed/plane6512.2086713.634123.8475.07290.622674.50230.07565.874956.70235.07013.468445.98320.0
Distance not landed total101.135165.18659.3145.0178.58174.2160.0190.64989.31660.0152.1846.29240.0
Distance not landed/plane15.50122.0116.22150.024.0087.66765.025.4849.22765.020.8424.90445.0
Distance not landed/plane %1.090.90.17635.00.8350.23250.00.7080.34960.00.9130.15635.0
Conflicts0.1251.6221.34325.00.2680.234<1.00.6470.2815.01.8051.44730.0
Instructions/plane5.0573.8591.02230.04.60.44315.04.0890.89925.04.5710.44615.0
Instructions total51.12539.48110.16935.045.4675.01420.039.38910.82835.044.7845.04820.0
NASA TLX Average [0, 100]45.72948.6282.4365.048.4332.1555.060.5813.19525.055.6118.77120.0
NASA TLX Average %0.5430.5650.025<1.00.6060.04410.00.4040.13525.00.5360.022<1.0
SASHA Q Average [0, 5]3.5783.3950.15620.03.6070.034<1.03.2570.29935.03.7750.17525.0
SASHA Q Average %0.7160.6790.03120.00.7210.007<1.00.6510.0635.00.7550.03525.0
Table A3. The tests NASA TLX and SASHA Q each consist of individual subtests. Their predictions and partial results are listed separately in this table.
Table A3. The tests NASA TLX and SASHA Q each consist of individual subtests. Their predictions and partial results are listed separately in this table.
ItemT1 Obs.ATC 1ATC 2ATC 3ATC 4
Mean T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. %
NASA TLX Average [0, 100]45.72955.1327.6820.055.3247.97125.054.8798.58625.058.5910.41920.0
NASA TLX Average %0.5430.5280.019<1.00.4810.04815.00.4890.05215.00.5590.026<1.0
    mental61.56254.6516.3915.080.4918.89850.061.331.8810.072.78.66815.0
    physical27.554.23725.67250.041.61412.0530.054.5625.58155.052.67723.35635.0
    temporal43.43759.15513.46330.069.93724.42960.068.51623.57555.079.21832.09650.0
    performance48.7562.72313.3830.030.14917.16645.048.4661.9310.037.7938.89615.0
    effort58.7560.2212.0780.082.27423.21860.067.8167.46620.080.2920.45735.0
    frustration34.37545.88111.19425.032.2821.8155.035.7151.9550.038.0592.9855.0
SASHA Q Average [0, 5]3.5783.440.12220.03.3720.1835.03.4040.16530.03.8070.20825.0
SASHA Q Average %0.7160.6880.02420.00.6740.03635.00.6810.03330.00.7610.04225.0
    manageable4.6253.8140.69330.03.1691.39965.03.3281.21355.03.7120.77725.0
    next steps4.53.8390.56625.03.6960.7740.03.4231.0750.04.2580.1515.0
    heavy focus2.3754.081.63465.02.8410.45525.03.1060.69935.02.1080.1475.0
    find info3.01.4571.43960.00.6782.06685.01.8171.14555.01.7641.2140.0
    valuable info3.3753.8320.44120.04.0250.55330.03.3980.0970.03.8720.44615.0
    attention3.6253.4880.1095.04.4310.75440.03.5410.0970.05.8612.22765.0
    understanding3.53.8340.31515.04.2950.72340.04.1770.67335.05.1951.53250.0
    awareness3.6252.8330.78635.02.9160.64935.02.7590.80840.03.7390.1470.0
Table A4. Continuation of Table A3.
Table A4. Continuation of Table A3.
KPIT1 Obs.ATC 5ATC 6ATC 7ATC 8
Mean T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. % T2 Pred. ± Conf. %
NASA TLX Average [0, 100]45.72948.6282.4365.048.4332.1555.060.5813.19525.055.6118.77120.0
NASA TLX Average %0.5430.5650.025<1.00.6060.04410.00.4040.13525.00.5360.022<1.0
    mental61.56264.3092.7165.064.2252.4025.061.9122.8920.066.1162.4195.0
    physical27.533.0043.0435.056.37925.77845.061.74931.02545.057.47229.31950.0
    temporal43.43747.0432.8045.061.22615.27330.078.64832.30250.066.38920.95940.0
    performance48.7545.1452.7875.033.94912.54425.050.7112.9670.046.3712.4820.0
    effort58.7580.33319.33635.061.9342.3585.069.5238.56215.067.4077.16215.0
    frustration34.37536.062.8210.023.7210.09220.048.35912.14620.038.392.5135.0
SASHA Q Average [0, 5]3.5783.3950.15620.03.6070.034<1.03.2570.29935.03.7750.17525.0
SASHA Q Average %0.7160.6790.03120.00.7210.007<1.00.6510.0635.00.7550.03525.0
    manageable4.6253.4191.04735.04.020.51620.02.7411.86455.04.0460.5220.0
    next steps4.53.0971.36745.04.520.1260.02.9251.45545.04.3840.1270.0
    heavy focus2.3753.1790.70725.02.5870.1235.03.621.24240.02.8960.50120.0
    find info3.00.2032.54275.00.6412.24975.01.0071.96560.01.3591.47355.0
    valuable info3.3753.4770.140.03.8630.37315.03.5310.1495.04.1730.76630.0
    attention3.6253.1980.42115.04.1030.37315.04.0160.29810.04.7931.04440.0
    understanding3.53.870.26810.03.9060.35715.03.8030.28610.04.2970.73430.0
    awareness3.6252.8380.70725.03.4190.1235.01.9911.60150.03.3330.24810.0

References

  1. European Commission. Reducing Emissions from Aviation. Available online: https://climate.ec.europa.eu/eu-action/transport/reducing-emissions-aviation_en (accessed on 8 April 2024).
  2. EUROCONTROL. Aviation Outlook 2050: Air Traffic Forecast Shows Aviation Pathway To Net Zero CO2 Emissions. 2022. Available online: https://www.eurocontrol.int/article/aviation-outlook-2050-air-traffic-forecast-shows-aviation-pathway-net-zero-co2-emissions (accessed on 8 April 2024).
  3. Perott, A.; Schader, N.T.; Leonhardt, J.; Licu, T. Human Factors Integration in ATM System Design. White paper, EUROCONTROL, 2019. [Google Scholar]
  4. IOS. Ergonomics of Human-System Interaction—Part 210: Human-Centred Design for Interactive Systems; International Organization for Standardization: Geneva, Switzerland, 2019. [Google Scholar]
  5. König, C.; Hofmann, T.; Bruder, R. Application of the user-centred design process according ISO 9241-210 in air traffic control. Work 2012, 41, 167–174. [Google Scholar] [CrossRef] [PubMed]
  6. Norman, D.A. The Design of Everyday Things; Basic Books: New York, NY, USA, 2002. [Google Scholar]
  7. Rubin, J.; Chisnell, D. Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  8. Stanton, N.A.; Salmon, P.M.; Rafferty, L.A.; Walker, G.H.; Baber, C.; Jenkins, D.P. Human Factors Methods: A Practical Guide for Engineering and Design, 2nd ed.; CRC Press: London, UK, 2013. [Google Scholar]
  9. Tullis, T.; Albert, W. Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics, 2nd ed.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2013. [Google Scholar]
  10. Bach, C.; Scapin, D.L. Comparing inspections and user testing for the evaluation of virtual environments. Int. J. Hum.-Comput. Interact. 2010, 26, 786–824. [Google Scholar] [CrossRef]
  11. Nielsen, J. Usability Engineering, 1st ed.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
  12. Dumas, J.S.; Redish, J. A Practical Guide to Usability Testing; Intellect Books: Bristol, UK, 1999. [Google Scholar]
  13. Dillon, A. The evaluation of software usability. In International Encyclopedia of Ergonomics and Human Factors; Karwowski, W., Ed.; Taylor & Francis: Hoboken, NJ, USA, 2001; pp. 1110–1112. [Google Scholar]
  14. Sagar, K.; Saha, A. A systematic review of software usability studies. Int. J. Inf. Technol. 2017, 1–24. [Google Scholar] [CrossRef]
  15. Bastien, C.J.M. Usability testing: A review of some methodological and technical aspects of the method. Int. J. Med. Inform. 2010, 79, e18–e23. [Google Scholar] [CrossRef] [PubMed]
  16. Bos, T.; Schuver-van Blanken, M.; Huisman, H. Towards a Paperless Air Traffic Control Tower. In Proceedings of the 2nd International Conference on Human Centered Design, Orlando, FL, USA, 9–14 July 2011; pp. 360–368. [Google Scholar] [CrossRef]
  17. Huber, S.; Gramlich, J.; Pauli, S.; Mundschenk, S.; Haugg, E.; Grundgeiger, T. Toward User Experience in ATC: Exploring Novel Interface Concepts for Air Traffic Control. Interact. Comput. 2022, 34, 43–59. [Google Scholar] [CrossRef]
  18. King, R.; Churchill, E.F.; Tan, C. Designing with Data: Improving the User Experience with A/B Testing, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017. [Google Scholar]
  19. Kohavi, R.; Henne, R.M.; Sommerfield, D. Practical Guide to Controlled Experiments on the Web: Listen to Your Customers Not to the Hippo. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, San Jose, CA, USA, 12–15 August 2007; pp. 959–967. [Google Scholar] [CrossRef]
  20. Young, S. Improving Library User Experience with A/B Testing: Principles and Process. Weav. J. Libr. User Exp. 2014, 1. [Google Scholar] [CrossRef]
  21. Quin, F.; Weyns, D.; Galster, M.; Costa Silva, C. A/B testing: A systematic literature review. J. Syst. Softw. 2024, 211, 112011. [Google Scholar] [CrossRef]
  22. Hagar, N.; Diakopoulos, N. Optimizing Content with A/B Headline Testing: Changing Newsroom Practices. Media Commun. 2019, 7, 117. [Google Scholar] [CrossRef]
  23. Meta. Fundraising/2013-14 Report—Meta, Discussion about Wikimedia Projects. 2020. Available online: https://meta.wikimedia.org/wiki/Fundraising/2013-14_Report (accessed on 8 April 2024).
  24. MediaWiki. Page Previews/2016 A/B Tests—MediaWiki. 2022. Available online: https://www.mediawiki.org/wiki/Page_Previews/2016_A/B_Tests (accessed on 8 April 2024).
  25. MediaWiki. Page Previews/2017-18 A/B Tests — MediaWiki. 2020. Available online: https://www.mediawiki.org/wiki/Page_Previews/2017-18_A/B_Tests (accessed on 8 April 2024).
  26. Milanzi, E.; Njeru Njagi, E.; Bruckers, L.; Molenberghs, G. Data Representativeness: Issues and Solutions. EFSA Support. Publ. 2015, 12, 759E. [Google Scholar] [CrossRef]
  27. Bethlehem, J. Applied Survey Methods: A Statistical Perspective; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
  28. Parsons, V.L. Stratified Sampling. In Wiley StatsRef: Statistics Reference Online; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2017; pp. 1–11. [Google Scholar] [CrossRef]
  29. Liberty, E.; Lang, K.; Shmakov, K. Stratified Sampling Meets Machine Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; Volume 48, pp. 2320–2329. [Google Scholar]
  30. Raiffa, H.; Schlaifer, R. Applied Statistical Decision Theory; Harvard University: New Haven, CT, USA, 1961. [Google Scholar]
  31. Ericson, W.A. Optimum Stratified Sampling Using Prior Information. J. Am. Stat. Assoc. 1965, 60, 750–771. [Google Scholar] [CrossRef]
  32. Hidiroglou, M.A.; Särndal, C.E. Use of auxiliary information for two-phase sampling. Surv. Methodol. 1998, 24, 11–20. [Google Scholar]
  33. Ahsan, M.J.; Khan, S. Optimum allocation in multivariate stratified random sampling with overhead cost. Metr. Int. J. Theor. Appl. Stat. 1982, 29, 71–78. [Google Scholar] [CrossRef]
  34. Khan, M.G.; Maiti, T.; Ahsan, M.J. An Optimal Multivariate Stratified Sampling Design Using Auxiliary Information: An Integer Solution Using Goal Programming Approach. J. Off. Stat. 2010, 26. [Google Scholar]
  35. Varshney, R.; Siddiqui, N.; Ahsan, M.J. Estimation of more than one parameters in stratified sampling with fixed budget. Math. Methods Oper. Res. 2012, 75. [Google Scholar] [CrossRef]
  36. Gupta, N.; Ali, I.; Bari, A. An Optimal Chance Constraint Multivariate Stratified Sampling Design Using Auxiliary Information. J. Math. Model. Algorithms 2013. [Google Scholar] [CrossRef]
  37. Deville, J.C.; Särndal, C.E. Calibration Estimators in Survey Sampling. J. Am. Stat. Assoc. 1992, 87, 376–382. [Google Scholar] [CrossRef]
  38. Singh, S.; Horn, S.; Chowdhury, S.; Yu, F. Theory & Methods: Calibration of the estimators of variance. Aust. N. Z. J. Stat. 2002, 41, 199–212. [Google Scholar] [CrossRef]
  39. Kim, J.M.; Sungur, E.; Heo, T.Y. Calibration approach estimators in stratified sampling. Stat. Probab. Lett. 2007, 77, 99–103. [Google Scholar] [CrossRef]
  40. Wu, C.; Sitter, R. A Model-Calibration Approach to Using Complete Auxiliary Information From Survey Data. J. Am. Stat. Assoc. 2001, 96, 185–193. [Google Scholar] [CrossRef]
  41. Rottermanner, G.; Settgast, V.; Judmaier, P.; Eschbacher, K.; Rokitansky, C.H. VAST: A High-Fidelity Prototype for Future Air Traffic Control Scenarios. In Proceedings of the 17th European Conference on Computer-Supported Cooperative Work, Salzburg, Austria, 9–12 June 2019; Volume 3. Reports of the European Society for Socially Embedded Technologies. [Google Scholar] [CrossRef]
  42. Rottermanner, G.; de Jesus Oliveira, V.A.; Lechner, P.; Graf, P.; Kreiger, M.; Wagner, M.; Iber, M.; Rokitansky, C.H.; Eschbacher, K.; Grantz, V.; et al. Design and Evaluation of a Tool to Support Air Traffic Control with 2D and 3D Visualizations. In Proceedings of the 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Atlanta, GA, USA, 22–26 March 2020; pp. 885–892. [Google Scholar] [CrossRef]
  43. Rind, A.; Iber, M.; Aigner, W. Bridging the gap between sonification and visualization. In Proceedings of the AVI Workshop on Multimodal Interaction for Data Visualization (MultimodalVis), Castiglione della Pescaia Grosseto, Italy, 29 May–1 June 2018. [Google Scholar]
  44. Rottermanner, G.; Wagner, M.; Kalteis, M.; Iber, M.; Judmaier, P.; Aigner, W.; Settgast, V.; Eggeling, E. Low-fidelity prototyping for the air traffic control domain. Mensch Comput. 2018, 605–614. [Google Scholar]
  45. Rottermanner, G.; Wagner, M.; Settgast, V.; Grantz, V.; Iber, M.; Kriegshaber, U.; Aigner, W.; Judmaier, P.; Eggeling, E. Requirements analysis & concepts for future european air traffic control systems. In Proceedings of the Workshop Vis in Practice-Visualization Solutions in the Wild, IEEE VIS, Phoenix, AZ, USA, 1–6 October 2017. [Google Scholar]
  46. Steinheimer, M.; Gonzaga-López, C.; Kern, C.; Kerschbaum, M.; Strauss, L.; Eschbacher, K.; Mayr, M.; Rokitansky, C.H. Air traffic management and weather: The potential of an integrated approach. In Proceedings of the International Conference on Air Transport (INAIR), Vienna, Austria, 10–11 November 2016; Hromádka, M., Ed.; University Of ŽIlina: Žilina, Slovakia; pp. 120–126. [Google Scholar]
  47. Hart, S.G. NASA Task Load Index (TLX). Volume 1.0; Paper and Pencil Package; NASA: Washington, DC, USA, 1986.
  48. Jeannot, E.; Kelly, C.; Thompson, D. The Development of Situation Awareness Measures in ATM Systems. Report Eurocontrol HRS; Technical Report, HSP-005-REP-01; EUROCONTROL: Brussels, Belgium, 2003. [Google Scholar]
  49. Durso, F.T.; Manning, C.A. Air Traffic Control. Rev. Hum. Factors Ergon. 2008, 4, 195–244. [Google Scholar] [CrossRef]
  50. Hilburn, B. Cognitive complexity in air traffic control: A literature review. EEC Note 2004, 4, 1–80. [Google Scholar]
  51. Goldberg, L.R. The Development of Markers For the Big Five Factor Structure. Psychol. Assess. 1992, 4, 26–42. [Google Scholar] [CrossRef]
  52. Wright, S. Correlation and causation. J. Agric. Res. 1921, 20, 557–585. [Google Scholar]
  53. Olive, D.J. Multiple Linear Regression. In Linear Regression; Springer International Publishing: Cham, Switzerland, 2017; pp. 17–83. [Google Scholar] [CrossRef]
Figure 1. The new approach presented here replaces some expert tests (no. 2, 3, 4, and 6, 7, 8; shown in grey) with non-expert tests (shown in green). Although the wrong target group is used, the results can be converted to the results of the expert tests (indicated by #) through statistical transformations and corrections. If some tests are replaced in this way, and if non-experts are cheaper and more readily available, this approach can both reduce costs and increase the number of tests.
Figure 1. The new approach presented here replaces some expert tests (no. 2, 3, 4, and 6, 7, 8; shown in grey) with non-expert tests (shown in green). Although the wrong target group is used, the results can be converted to the results of the expert tests (indicated by #) through statistical transformations and corrections. If some tests are replaced in this way, and if non-experts are cheaper and more readily available, this approach can both reduce costs and increase the number of tests.
Aerospace 11 00574 g001
Figure 2. The main part of the transformation model is a mathematical representation of each expert (resp. the expert’s KPIs) by a weighted sum of non-experts (resp. their KPIs).
Figure 2. The main part of the transformation model is a mathematical representation of each expert (resp. the expert’s KPIs) by a weighted sum of non-experts (resp. their KPIs).
Aerospace 11 00574 g002
Figure 3. This matrix shows the Pearson’s correlation coefficients between dependent and independent variables.
Figure 3. This matrix shows the Pearson’s correlation coefficients between dependent and independent variables.
Aerospace 11 00574 g003
Table 1. These key performance indicators were used to assess the performance of participants within the test scenarios and were further integrated into the transformation model to establish a mapping between ATCs and students.
Table 1. These key performance indicators were used to assess the performance of participants within the test scenarios and were further integrated into the transformation model to establish a mapping between ATCs and students.
 KPIDescription
#1Taken over (#)Number of planes taken over by the test subject
#2Taken over (%)Percentage of optimal number of taken-over planes
#3Time until takeover total (mm:ss)Duration from the radio message from the aircraft to acceptance by the test subject summed across all planes
#4Time until takeover/plane (mm:ss)Duration from the radio message from the aircraft to acceptance by the test subject per plane
#5Landings 1 (#)Number of planes landed by the test subject
#6Landings 2 (#)Number of non-landed planes already in position to land with distance to the runway < 10 km and height < 1000 ft
#7Landings 3 (#)Number of non-landed planes already in position to land with distance to the runway < 10 km and height < 5000 ft
#8Calculated Landings (#)Number of planes landed by the test subject plus planes close to landing (Landings 2 and Landings 3); calculated via Landings 1 + 1 2 Landings 2 + 1 4 Landings 3
#9Optimum Landings (%)Percentage of optimum of landed planes
#10Calculated Optimum Landings (%)Percentage of optimum of calculated landings
#11Time deviation to landing total (mm:ss)Total deviation from the simulated landing times of the ATS
#12Time deviation to landing/plane (mm:ss)Deviation per plane from the simulated landing time of the ATS
#13Distance deviation to landing total (km)Total deviation from the simulated routes of the ATS
#14Distance deviation to landing/plane (km)Deviation per plane from the simulated route of the ATS
#15Height not landed total (ft)Total height of the non-landed planes
#16Height not landed/plane (ft)Average height per plane of the non-landed planes
#17Distance not landed total (km)Total distance of the non-landed planes to the runway
#18Distance not landed/plane (km)Average distance per plane of the non-landed planes to the runway
#19Distance not landed/plane (%)Average distance per plane of the non-landed planes to the runway in relation to the ATS simulation
#20Conflicts (#)Number of losses of separation
#21Instructions/plane (#)Number of instructions given by the test subject per plane
#22Instructions total (#)Total number of instructions given by the test subject
#23NASA TLX Average ([0, 100])Average of NASA TLX results
#24NASA TLX Average (%)Percentage of optimal NASA TLX score
#25SASHA_Q Average ([1, 5])Average of SASHA_Q results
#26SASHA_Q Average (%)Percentage of optimal SASHA_Q score
Table 2. Each test participant and the corresponding test results consist of 117 values. This table shows how they are allocated to the different components of the test.
Table 2. Each test participant and the corresponding test results consist of 117 values. This table shows how they are allocated to the different components of the test.
ComponentNumber of Values
KPIs26
NASA TLX questionnaire6
SASHA_Q questionnaire8
Psychological questionnaire (auxiliary information)77
Table 3. The least squares estimates ε ^ , β ^ 1 , …, β ^ 19 represent the multiple linear regression (MLR) models to represent the results of experts by the results of non-experts.
Table 3. The least squares estimates ε ^ , β ^ 1 , …, β ^ 19 represent the multiple linear regression (MLR) models to represent the results of experts by the results of non-experts.
ModelATC 1ATC 2ATC 3ATC 4ATC 5ATC 6ATC 7ATC 8
ε ^ 0.13540.10430.15360.20740.30210.23660.15710.1873
β ^ 1 −0.2409−0.0143−0.1072−0.1475−0.06730.0317−0.0244-0.0591
β ^ 2 0.24600.13270.3101−0.1876−0.17020.12510.1803−0.0831
β ^ 3 0.3458−0.03610.0436−0.0904−0.01250.01490.14510.0362
β ^ 4 0.0775−0.16030.00080.2113−0.0302−0.2141−0.02260.0626
β ^ 5 −0.00260.0228−0.0076−0.0375−0.00830.06580.00590.1014
β ^ 6 0.15860.08840.13520.12800.04820.24480.05500.0855
β ^ 7 0.02540.0143−0.1737−0.2004−0.10000.0210−0.10310.0233
β ^ 8 −0.0843−0.0230−0.10200.2365−0.1196−0.0099−0.14510.1066
β ^ 9 −0.1471−0.01640.01340.07550.1821−0.0203−0.0790−0.0756
β ^ 10 −0.3673−0.2102−0.2407−0.2007−0.4335−0.3134−0.3361−0.2876
β ^ 11 −0.0477−0.09830.0192−0.1371−0.0023−0.0976−0.1815−0.1058
β ^ 12 0.14270.00390.11000.1477−0.06130.23950.19550.2445
β ^ 13 0.00500.1125−0.0561−0.06080.2664−0.0121−0.0247−0.0120
β ^ 14 0.28830.37990.21210.01070.33670.14480.47690.0188
β ^ 15 0.1091−0.07610.0741−0.04980.1446−0.1154−0.13040.0282
β ^ 16 −0.13070.04010.04470.2868−0.06720.0309−0.07490.1060
β ^ 17 0.29000.30360.33210.42340.24710.41140.40320.3903
β ^ 18 −0.04670.0720−0.13780.20000.2054−0.00580.19790.0681
β ^ 19 0.19270.29220.26270.13210.09210.08190.15850.0749
min−0.3673−0.2102−0.2407−0.2007−0.4335−0.3134−0.3361−0.2876
max0.34580.37990.33210.42340.33670.41140.47690.3903
mean0.04280.04350.03850.03890.02360.03270.03660.0380
std.-dev.0.18590.14850.15640.17870.17720.15960.19620.1361
variance0.03450.02200.02440.03190.03140.02540.03850.0185
Table 4. The coefficients of determination R 2 and R a d j 2 can be interpreted as the proportion of variance in the data that is explained by the regression model. The adjusted R a d j 2 takes the model size into account; the not-adjusted coefficient of determination R 2 automatically increases when additional variables are added to the model.
Table 4. The coefficients of determination R 2 and R a d j 2 can be interpreted as the proportion of variance in the data that is explained by the regression model. The adjusted R a d j 2 takes the model size into account; the not-adjusted coefficient of determination R 2 automatically increases when additional variables are added to the model.
Coefficient of DeterminationATC 1ATC 2ATC 3ATC 4ATC 5ATC 6ATC 7ATC 8
R 2 0.61360.65550.54580.44990.51540.51240.53970.5546
R a d j 2 0.53790.58800.45690.34210.42050.41690.44950.4673
Table 5. The transformation model uses the non-expert (student) results to predict the expert (ATC) results. Compared to the real test results of the experts in Task 2, the transformation model achieves an accuracy with a relative error of less than 1% in 1 out of 26 KPIs, a relative error between 1% and 5% in 9 out of 26 KPIs, a relative error between 5% and 10% in 5 out of 26 KPIs and a relative error greater than 10% in 11 out of 26 KPIs. The main concept of the transformation model is based on auxiliary information. To illustrate its power, the transformation results based on a linear model without auxiliary information have been included as well.
Table 5. The transformation model uses the non-expert (student) results to predict the expert (ATC) results. Compared to the real test results of the experts in Task 2, the transformation model achieves an accuracy with a relative error of less than 1% in 1 out of 26 KPIs, a relative error between 1% and 5% in 9 out of 26 KPIs, a relative error between 5% and 10% in 5 out of 26 KPIs and a relative error greater than 10% in 11 out of 26 KPIs. The main concept of the transformation model is based on auxiliary information. To illustrate its power, the transformation results based on a linear model without auxiliary information have been included as well.
KPIObservationWithout Aux. Info.With Aux. Info.Improvement
   Prediction Error Prediction Error
Taken over (#)9.75010.59714.2%9.5653.1%+11.1%
Taken over (%)0.8860.96314.2%0.8703.1%+11.1%
Time until takeover total (mm:ss)172.625−275.68719.9%521.35915.5%+4.4%
Time until takeover/plane (mm:ss)17.750−39.83821.4%59.75015.6%+5.8%
Landings 1 (#)4.0005.57332.5%3.8742.6%+29.8%
Landings 2 (#)0.5001.64895.8%0.79624.7%+71.1%
Landings 3 (#)1.6252.11413.6%1.23210.9%+2.7%
Calculated Landings (#)4.6566.83037.7%4.4833.0%+34.7%
Optimum Landings (%)0.801.11532.4%0.7752.6%+29.8%
Calculated Optimum Landings (%)0.7761.15534.6%0.7641.1%+33.5%
Time deviation to landing total (mm:ss)−73.875−17.0696.3%14.3169.8%−3.5%
Time deviation to landing/plane (mm:ss)−10.25015.9906.9%2.7113.4%+3.5%
Distance deviation to landing total (km)4.9294.5450.3%8.4943.0%−2.7%
Distance deviation to landing/plane (km)2.3920.8574.0%2.1220.7%+3.3%
Height not landed total (ft)46,901.50057,265.58725.6%50,987.71210.1%+15.5%
Height not landed/plane (ft)6671.0319012.77552.6%7111.8419.9%+42.7%
Distance not landed total (km)132.568156.49010.3%171.08116.6%−6.3%
Distance not landed/plane (km)18.90425.16729.7%23.01819.5%+10.2%
Distance not landed/plane (%)0.8680.76111.9%0.8353.7%+8.2%
Conflicts (#)0.375−2.34628.4%1.3159.8%+18.6%
Instructions/plane (#)5.9244.14928.4%4.46023.4%+5.0%
Instructions total (#)57.25046.35920.6%42.99027.0%−6.4%
NASA TLX Average ([0, 100])37.39634.7962.8%54.64718.5%−15.7%
NASA TLX Average (%)0.6260.7232.8%0.52111.2%−15.7%
    mental56.56220.90935.8%65.7179.2%+26.6%
    physical31.25085.87554.6%51.46220.2%+34.4%
    temporal37.18837.9000.7%66.26629.1%−28.4%
    performance27.50041.18513.7%44.41416.9%−3.2%
    effort47.18829.45117.7%71.22524.0%−6.3%
    frustration24.6881.84822.8%37.30812.6%+10.2%
SASHA Q Average ([0, 5])3.4384.01243.0%3.5075.2%+37.8%
SASHA Q Average (%)0.6880.80243.0%0.7015.2%+37.8%
    manageable4.7506.66438.3%3.53124.4%+13.9%
    next steps4.6256.56238.8%3.76817.2%+21.6%
    heavy focus2.1252.1741.0%3.05218.5%−17.5%
    find info2.500−1.42078.4%1.11627.7%+50.8%
    valuable info3.3755.50842.5%3.7717.9%+34.6%
    attention3.0003.94018.8%4.17923.6%−4.8%
    understanding3.5003.4311.3%4.17213.4%−12.0%
    awareness3.6253.8664.8%2.97912.9%−8.1%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stranger, P.; Judmaier, P.; Rottermanner, G.; Rokitansky, C.-H.; Szilagyi, I.-S.; Settgast, V.; Ullrich, T. A Novel Approach Using Non-Experts and Transformation Models to Predict the Performance of Experts in A/B Tests. Aerospace 2024, 11, 574. https://doi.org/10.3390/aerospace11070574

AMA Style

Stranger P, Judmaier P, Rottermanner G, Rokitansky C-H, Szilagyi I-S, Settgast V, Ullrich T. A Novel Approach Using Non-Experts and Transformation Models to Predict the Performance of Experts in A/B Tests. Aerospace. 2024; 11(7):574. https://doi.org/10.3390/aerospace11070574

Chicago/Turabian Style

Stranger, Phillip, Peter Judmaier, Gernot Rottermanner, Carl-Herbert Rokitansky, Istvan-Szilard Szilagyi, Volker Settgast, and Torsten Ullrich. 2024. "A Novel Approach Using Non-Experts and Transformation Models to Predict the Performance of Experts in A/B Tests" Aerospace 11, no. 7: 574. https://doi.org/10.3390/aerospace11070574

APA Style

Stranger, P., Judmaier, P., Rottermanner, G., Rokitansky, C. -H., Szilagyi, I. -S., Settgast, V., & Ullrich, T. (2024). A Novel Approach Using Non-Experts and Transformation Models to Predict the Performance of Experts in A/B Tests. Aerospace, 11(7), 574. https://doi.org/10.3390/aerospace11070574

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop