1. Introduction
Engineering robot personalities for practical applications entails several idiosyncratic challenges. First, and foremost, it means engineering robot personalities in significant quantities, as opposed to the one or two robots to experiment on. By ‘significant’, we mean that the number of robot personalities should be significantly large for practical applications. How large is ‘significant’ depends on the application. For example, a music band of robots might require three to ten robot personalities, so that each of them appears to their audience as an individual character like how it is with human musicians. A robotic staff for a business establishment might require tens or hundreds, so that their human colleagues can recognise them as individuals as they do other humans. As for domestic service robots, it could put their human users in a more comfortable and constructive frame of mind if their robots are unique existence that cares specially for them, that is different from all the same models of robots working for their relatives or friends and different from the identical units that are mowing the lawns of strangers. Generally speaking, it depends on how many robots will be observed together as a population associated with a quality, goal, purpose, or workplace. Adding to this challenge, robots, if useful enough, will be mass-produced. While the hardware of robots can be mass-produced by duplicating the same design, personalities cannot be copied by definition [
1]. Based on the current understanding of human personalities [
2], we define a
robot personality as a robot exhibiting characteristic patterns of computation and behaviour with inter- and intra-individual differences. The differences manifest as personality traits qualified on personality dimensions and quantified as measurements of their strength on the corresponding dimensions. As with a human personality, a robot’s personality individuates the robot as a unique presence different from the rest of their kind. Therefore, to engineer a robot personality is to engineer its individuality, which is significantly harder to do in large quantities. To engineer mass-produced, physically identical robots into significant quantities of desirable personalities is hence one of the the main challenges of robot personalities engineering for practical applications.
To engineer robot personalities in significant quantities, we consider such an optimisation process. We first design some unique archetypes, desirable target personalities on which some robot personalities will be based, and then proceed to minimise the distances between the robot personalities to construct and their corresponding archetypes in a coordinate system called a personality space, as defined by some personality dimensions. When the distances are small enough, the goal is achieved. (Optimisation methods are not the focus of this work). The question is what are those personality dimensions?
The answers should come from the users of the robots, since they are the main observers, judges, and beneficiaries of the robots’ personalities. Most users are humans. Humans exhibit a tendency to attribute human qualities to non-human entities. For this reason, human personality dimensions, especially those of trait-based models, are frequently used as the bases for synthesising artificial personalities [
3]. Trait-based personality models are often formulated as a set of personality dimensions on which an individual’s traits can be measured, such as the five-factor model, first discovered by Tupes and Christal in 1961 (a reprint of their work is available [
4]), and subsequently by Norman in 1963 [
5]. The arguably most popular variant has these five dimensions [
6]: extraversion, agreeableness, conscientiousness, neuroticism, and openness—known as the ‘Big Five’ [
7]. However, there are several issues in applying human personality models to engineering robot personalities. The first issue is the most critical: the applicability of personality dimensions. Human personality models are often empirical formulations based on a lexicon formed overtime for describing humans, as explained by the lexical hypothesis [
7,
8]. Lexicons for describing humans do not necessarily apply to describing artificial agents [
9]. It violates the lexical hypothesis to apply a human personality model to robots that are not created to resemble human presences (i.e., androids). The implication of this issue is that it is possible that not all personality dimensions of a model apply to a type of robot, especially when its appearance is far from a human image. Another issue is that, albeit of objectively equal importance, not all personality traits are equally prominent in people’s eyes. A character is often recognised by their most memorable traits, as predicted by the ‘availability bias’ [
10], and there can be ‘central traits’ that dominate the overall impressions of an individual [
11]. How an individual is perceived is often affected by cognitive biases, which are something to consider or even leverage in robot personalities engineering. Last but not least, whether optimisation is possible on a personality dimension depends on how some specific users consider the personalities of a type of robot, especially whether they can provide effective feedback to guide the optimisation of the traits on a personality dimension. The dimensions may vary from user group to user group since it is possible that not all people consider a type of trait to be relevant to a type of robot that is not androids. To summarise, the first issue implies that we cannot be so sure if a human personality dimension applies to a robot that is not an android. The second issue implies that we need to focus on traits that matter most to users. The third issue implies that the specific users of certain robots have the ultimate say on what traits apply and matter to their robots. Such traits are which they can provide effective feedback on. Therefore, we need a test procedure to identify personality dimensions on which some users can provide effective feedback about the personality traits of their robots, as an engineering tool for engineering robot personalities out of a type of robot knowing its typical usage.
The main contribution of this work is such a test procedure. It applies to robots that can imitate human behaviour and small user groups with at least eight people. It is an engineering tool to identify the dimensions on which users can provide effective feedback to guide the engineering of robot personalities. At the beginning of the engineering work, the identified personality dimensions can serve as recommendations for some aspects of the robot personalities to focus on, and during the optimisation of robot personalities, they can constitute a coordinate system where the quality of the robot personalities under construction can be measured with the corresponding archetypes as reference points. As far as the recent surveys [
3,
12,
13] can tell, we are the first to propose such a test procedure dedicated to engineering robot personalities in significant quantities for practical applications. To test the proposed test procedure, we conducted a serious of tests simulating engineering tasks where 10 robot personalities were to be engineered out of 10 personality archetypes for small user groups of 3 to 18 people using the dimensions of the five-factor model [
6]. The type of robot, a life-size humanoid ‘barebones’ robot, engaged users in dyadic communication, and its main modes of personality expression were head and eye movements. We confirmed the effectiveness of the proposed test procedure within the scope of the tests. The results show that the proposed method worked for user groups with at least eight people.
The rest of this article is organised as follows:
Section 2 elaborates on the research goal, why we need the test procedure, and examines the insufficiency of previous work.
Section 3 presents the test procedure.
Section 4 relates the experiment conducted to test the test procedure.
Section 5 shows the results of the experiment.
Section 6 discusses the results and limitations, and
Section 7 concludes this work.
2. Goals and the Limitations of the State of the Art in the Engineering Context
The test procedure should tell us, with the help of a user group, which of the personality dimensions of a personality model are suitable for engineering robot personalities out of a type of robot knowing its typical usage. It should meet three requirements:
It identifies personality dimensions on which a group of users can provide effective feedback to guide the optimisation of the personality traits of the robot personalities under construction (as by minimising the distances between the robot personalities under construction and the corresponding archetypes);
It supports engineering significant quantities of robot personalities;
It works with small user groups.
Many previous studies have more or less done similar work in exploring robots’ potential for expressing personality traits or studying the effects or properties of robot personalities [
14,
15,
16,
17,
18,
19,
20,
21,
22,
23]. They offer valuable scientific insights into the roles of robot personality in human–robot interaction. However, their ‘tests’ were unsuitable for engineering tasks for they fell short of at least one of the above requirements.
The first and most common limitation is to consider the types of personalities rather than personalities with individual differences, which we call the ‘binary trait’ simplification. Vinciarelli and Mohammadi have referred to splitting personalities into two classes (per dimension) as
‘binary classification approaches’ and in their extensive survey commented that binary classes are
‘not meaningful from a psychological point of view’ [
3]. Their survey has revealed that ‘binary classification approaches’ were prevalent in the field. We would also like to argue that the simplification is not meaningful from an engineering point of view either. In fact, it defeats the purpose entirely. To engineer robot personalities in significant quantities, the ‘binary trait’ simplification must not apply since we need to engineer far greater numbers than two robot personalities per dimension, with each of them manifesting their individual differences in characteristics as their traits. The proposed test procedure does not require the ‘binary trait’ simplification and hence is not subject to the said limitation.
If considered in the personalities engineering context, the second limitation of previous ‘tests’ is the requirement of a large sample of observers. Most if not all previous studies involved more than 20 observers, which is appropriate for studying effects or properties of robot personalities, where large representative samples are desirable. However, in engineering robot personalities for practical applications, most user groups will be small. Potential household user groups will mostly consist of two to nine people, extrapolating from the UN’s data in 2017 on household sizes [
24]. As for small businesses, as of the time of this work, in the United States, currently the largest economy, the average number of employees is about 10, and for small businesses that have employees, the numbers of employees range from 1 to 19 [
25]. As of Japan in 2019, small enterprises with fewer than 20 employees accounted for 85 percent of all enterprises, and those in service, retail, and wholesale industries had up to five employees [
26]. Whether we consider potential household users or enterprise users, small user groups with fewer than 20 people will be the most common. The test results from one (large) sample of subjects do not necessarily apply to other (smaller) user groups due to how people perceive personalities depending on their own personalities and a number of other factors [
27,
28,
29,
30,
31,
31]. The results can be highly user-dependent and contextual, meaning different user groups may perceive the same robot personalities differently, and the same user group may perceive the same robot personalities differently under different circumstances. The same behaviour may indicate different traits in different minds in different contexts. Something as straightforward as eye contact can be pointing to different traits in different cultures. In many eastern cultures, staring into the other’s eyes is being confrontational or arrogant; in many western cultures, not doing so is being disrespectful and showing disinterest or guilt. It is not only cultural differences. Interpersonal differences should also be taken into account. Between lovers, staring into the other’s eyes can be a cue of strong affection, in eastern or western cultures alike; however, between rivals, it is expressing animosity, strength, resolution, among other possibilities. It is almost certain that, due to the complexity of personality and how it is perceived, a group of users is unlikely to perceive the robot personalities under construction as similar to the archetypes without optimisation, even when the robots exhibit the most ‘archetypal’ behaviour. This ‘self-other’ discrepancy has been observed in perceiving human personalities for a long time [
32], and in the perception of robot personalities as well [
33]. The discrepancy should be minimised for the sake of consistent user experience. What dimensions and how much distance to minimise depend on the users. It follows that we should optimise robot personalities on a case-by-case basis.
The third limitation and most fatal one is missing the step to check the applicability of human personality dimensions to robots that are not androids. We can assume human personality dimensions are applicable to androids since they are created to resemble humans and hence do. For other robots, we cannot be so sure if the items of a dimension describe a robot (e.g., whether an ‘open-minded’ robot vacuum cleaner makes sense to certain users). Consequently, this limitation would apply to nearly all previous testing methods if they were to be applied to personalities engineering unless the robots to engineer were androids, since the results acquired would be in violation of the lexical hypothesis [
7,
8]. The proposed method is itself a guard against descriptions that do not make sense regarding the type of robot to engineer.
3. Methods: Proposed Test Procedure
The test procedure tests hypotheses in the following format:
Hypothesis (Format). Given a type of robot R exhibiting typical behaviour B in situation S and a personality dimension P of a personality model M, P is a personality dimension on which fidelity can be effectively optimised.
Here, fidelity refers to the proximity between the robot personalities under construction and their corresponding archetypes on the personality dimensions of the model: the higher the fidelity/proximity, the smaller the distances. The exact definition of fidelity should depend on the personality model to investigate. Generally speaking, the fidelity of a trait of a robot personality under construction is the accretion of distance measurements between the archetype after which the robot personality is engineered and a number of observations of the trait from a group of observers on the corresponding personality dimension. Since a robot personality has to be observed (unlike humans, robots cannot report their own personalities as their ‘true’ personalities; what serve as their supposed ‘true’ personalities are their corresponding archetypes, which do not necessarily match their observed personalities without optimisation), the observers’ own personalities and backgrounds affect their observations, a measurement of fidelity is always associated with a particular group of observers, and there can be no absolutely objective fidelity measurement.
Testing a hypothesis in the format requires testing three corresponding sub-hypotheses:
Sub-Hypothesis 1. The fidelity, computed as the proximity between the human observations and the corresponding archetypes, is statistically distinguishable from that by random guesses.
Sub-Hypothesis 2. There is a significant difference between the consistency of the observations on the robot personality and that of those on a human personality in the same settings.
Here, consistency refers to the negative dispersion among the impressions of a significant number of observers on a robot personality: the more dispersed the impressions are, the lower the consistency. Usually, consistency can be measured as negative variance.
Sub-Hypothesis 3. The pseudo-fidelity, computed as the proximity between the human observations and the corresponding observers’ own personalities, is statistically distinguishable from that by random guesses.
To identify a dimension as engineering-worthy, we should reject only the null hypothesis of Sub-Hypothesis 1 (Case 5 in
Table 1). The rationale behind is explained by the following three working hypotheses:
Working Hypothesis 1. When some human observers are using a personality model designed based on a lexicon for describing humans or animals to assess the personality of a robot personality engineered after an archetype, the corresponding fidelity is not necessarily statistically distinguishable from that by random guesses, which implies that they have completed the assessment by guesswork.
If the observers are just guessing the traits, their observations cannot be used to guide the optimisation of fidelity. However:
Working Hypothesis 2. If a robot is capable of imitating human behaviour in a given context where such behaviour is expected and the behaviour is typical of the robot in a usage that matches the context, some traits that apply to humans will also apply to the robot and are prominently observable in their typical behaviour, thereby resulting in fidelity that is statistically distinguishable from that by random guesses.
Even if the fidelity is non-random, we need to eliminate two other possible causes of non-randomness to make sure that it can guide personalities engineering: inconsistency and observers’ own personalities. Reports scattered on a personality dimension can still lead to significant differences from random guesses if they are dispersed enough. Highly dispersed reports reflect great inconsistency of opinions on the robot personalities. If the observers report the robot personalities as similar to their own, which can occur [
34], the corresponding fidelity will also be non-random while being irrelevant to the archetypes.
Working Hypothesis 3. If the fidelity from human observations that are as consistent as on a human personality in the same settings is statistically distinguishable from that by random guesses and the cause of it is not that the observers have reported the robot personalities to be similar to their own, the human observations can be used to guide the optimisation of the fidelity.
The third working hypothesis is supported by the following reasoning: when the robot’s behaviour is mapped to personality measurements as completed by some human observers and the mapping is not random but consistent, there exists an instance of behaviour leading to measurements that are closest to the corresponding archetype. By approximating that instance of behaviour, we can approximate the optimal fidelity. Assuming the robot’s behaviour is controlled by some parameters of a generative personality model, a personality model capable of generating behaviour with individual differences to reflect individual differences in personality, there should exist a set of parameters leading to the optimal behaviour. In that regard, common optimisation methods should apply to finding the parameters, such as gradient descent and genetic algorithms. However, whether they are efficient is another story.
The proposed test procedure is based on the ‘robots-imitating-humans’ approach [
35,
36,
37] and existing statistical tools. The resources required are:
Fungible units of the type of robot to test;
Human archetypes who can serve as desirable examples for the type of robot in performing the tasks it is designed for;
Tools to capture the example behaviour as data;
Methods to enable the robot to imitate the example behaviour;
Actors that act as users;
The user group the robots are going to work for.
The procedure has four phases, as illustrated in
Figure 1 (where the arrows mark dependencies; the capsule is the starting point; the rectangles are processes; the cylinders are data sets or materials; and the hexagons are the results of processes). It is worth noting that it is unnecessary to perform the entire procedure from the beginning when more orders for the same model of robots for the same usage come from some other user groups; in this case, we can start with user assessment. In the following subsections, we will go through the four phases one by one in detail.
3.1. Phase 1: Recording Archetypal Behaviour
In the first phase, we first recruit some candidate archetypes, and then, in ‘behaviour recording sessions’, we acquire their personality measurements, record their behaviour, and let them report the personalities of the actors acting as users. How behaviour recording sessions should be carried out depends. Generally speaking, it is a simulation of the typical usage of the type of robot to develop, where the candidate archetypes will be examples for the robot personalities. For example, if the type of robot is going to be office errand runners, we hire model (well-received) office workers to simulate an office environment; if it is going to be waiters in a restaurant, we hire model waiters and waitresses and simulate an restaurant; if it is going to be singers on the stage, we simulate a stage with real singers.
3.2. Phase 2: Implementing the Behaviour
In this phase, we produce the stimuli for user assessment in Phase 3. The stimuli can be video recordings of the simulation in Phase 1 or they can be the robot personalities themselves. For the latter, more than one unit of the type of robot may be required. How to produce the stimuli depends. Generally speaking, the robots’ behaviour should be as close to the archetypes’ as possible. First, in ‘screening’, we need to exclude the candidates whose behaviour is beyond the robot’s capabilities or operational parameters. We use only the behavioural data from the selected candidates, who will be the human archetypes. Then, we process the data. How to do this depends. In general, it is turning videos or motion capture data into a form that the robot can imitate. For example, in developing waiter robots, this can be processing a waiter’s gestures and body motion when ushering guests into their seats into joint tracking data. Next, in ‘extracting behaviour’, we further separate the processed data according to the mode (modality) of behaviour we are interested to investigate. For example, for developing waiter robots, we might be more interested in facial expressions and hand gestures than gait as modes of personality expression. Finally, we program the robot with the behaviour to recreate the simulation in Phase 1 as how we produce stimuli for user assessment in Phase 3.
3.3. Phase 3: User Assessment
In Phase 3, we first make preparations for user assessment and then request the users to assess the robot personalities. The users either assess the robot personalities based on video stimuli or by interacting with the robot personalities themselves. For assessment based on video stimuli, we need to prepare only surveys. For assessment based on live interaction, we need to prepare for interactive settings as close to the simulations in Phase 1 as possible. After the surveys or interaction sessions, users also need to report their own personalities.
3.4. Phase 4: Three Tests
In Phase 4, we identify personality dimensions using three tests. We henceforth refer to users also as observers, since the users of the robots are the main observers of the corresponding robot personalities.
3.4.1. Data Sets Required
The tests require five data sets in total (
Table 2), four of which are from the previous phases: the archetypal personality self-reports, hereinafter denoted as
, from Phase 2; the candidates’ (or archetypes’) reports on the actor’s personality (if the simulation in Phase 1 has involved multiple actors and reports have been acquired on all the actors, the reports on one of them are enough), hereinafter denoted as
, from Phase 1; the users’ reports on the robot personalities, hereinafter denoted as
B, from Phase 3; and the personality self-reports by the users, hereinafter denoted as
, from Phase 3. The fifth data set consists of random reports generated on demand, hereinafter denoted as
N. Here,
B and
N are not in bold because they are sets. In addition,
r denotes the number of robot personalities to engineer, which is the same as the number of archetypes;
u the number of users;
t the number of personality dimensions to test; and
c the number of reports in the data set
.
is a
matrix. It consists of
rt-dimensional row vectors corresponding to
r archetypes or
tr-dimensional column vectors corresponding to
t personality dimensions.
is a
matrix. It consists of
ct-dimensional row vectors corresponding to
c sets of human observations on a human or
tc-dimensional column vectors corresponding to
t personality dimensions.
B is a set that consists of
u matrices:
, since each of the observers has reported
t traits on
r robot personalities; each matrix has the same dimensions as
and can be expressed likewise.
is a
matrix. It consists of
ut-dimensional row vectors corresponding to
u observers (users) or
tu-dimensional column vectors corresponding to
t personality dimensions.
N consists of randomly generated data per the dimensions required. To summarise, we have
where
,
, and
;
where
,
, and
;
where
;
where
and
, and
.
N represents a ‘blind guesser’ who cannot perceive any personality traits and thus has no recourse but to guesswork when it is required to complete a personality assessment. It is what the observers are pitted against. Personality traits are often assessed with statements describing certain qualities of a subject, such as: ‘… is someone who likes to talk with friends’. The users must indicate the extent to which they agree or disagree with the statements. They are guessing if they have no idea how well the statements describe the robot personalities, or they can simply indicate that they neither agree nor disagree with the statements. A significantly large number of wild guesses should exhibit the same behaviour as random guesses generated by the uniform distribution.
3.4.2. Test 1: Robot Personalities’ Fidelity Test
The first test to run is the fidelity test. It tests Sub-Hypothesis 1 on each dimension.
It requires data sets
,
B, and
N. Let
s denote the total number of reports,
The fidelity test compares the fidelity from the human observations with that from N to identify the personality dimensions.
We can compute fidelity as follows: given a trait as observed by
u users, which can be represented by a vector
(
), and the corresponding trait of the archetype
a (
), fidelity vector
(
) is expressed as
where
denotes a function that replaces all elements in a matrix or vector with their absolute values. (To avoid confusion, we refrain from using
since it also denotes the determinant of a matrix).
Each element in the fidelity vector
is a numerical distance, which can also be called ‘a proximity value’. For instances, if Observer 5 has reported the extraversion level of a robot to be 3.5 when the robot personality is based on Archetype 3, whose extraversion level is 4, then the 5th element in the
of extraversion is 0.5. The fidelity of all robot personalities on all dimensions can be represented as a
matrix:
(
).
can be computed from
and
B using
We generate the same number of random reports to the assessment scales. A random report consists of random numbers generated as random answers to the questionnaires about the robot personalities. For example, given a questionnaire consists of 44 questions with 5-point Likert scales enquiring how much an observer agrees with the corresponding statement, a human report would consist of 44 responses, whereas a random report would consist of 44 random integers on the range of
. Random traits should be computed in the same way per the instructions of the personality assessment inventory, and the results are divided into
u matrices, which are denoted here as
(
). We then compute the random ‘fidelity’
(
) using
With and ready, we apply an appropriate statistical test. Which test should be applied depends on the sample sizes, types of distributions in the samples, any underlying assumptions about the data, and whatnot. A test should be carefully chosen to yield practical results.
Let
T denote a function that performs the featured test thus: it takes two real matrices of the same dimensions as the input and then returns a real vector of
p-values as the output. The function computes the
p-values between the two corresponding columns of the same index in the two matrices. Therefore, the
p-values, represented as a vector
(
, can be computed as
Other statistics can be computed likewise.
Here, we are testing t hypotheses simultaneously. Thus, the problem arises as whether corrections for the multiple comparisons problem should be applied. In scientific research, corrections are often applied. However, in engineering robot personalities, it should depend on the circumstances of the specific engineering task. Whether it is better to minimise the chance for either Type 1 or Type 2 errors depends. If it is more important to reduce cost and focus on the most prominent personality traits, it might be better to apply the corrections so that it is less likely to identify a personality dimension by chance while in truth optimisation cannot be effectively conducted on that dimension. If it is more important to utilise the full potential of the robots, so as to make them more ‘characteristic-rich’, it might be better to not apply the corrections so that it is less likely to disqualify a dimension by chance.
3.4.3. Test 2: Robot Personalities’ Consistency Test
The consistency test follows the fidelity test. It tests Sub-Hypothesis 2 for every robot personality on each dimension. This test requires data sets B and .
Consistency measures the strength of the consensus on a robot personality some observers can reach. Given that a benchmark is yet to be established in the field, for the time being, the consistency of humans reporting on a human can be the standard for that of humans reporting on a robot in the same settings. We can use Bartlett’s test as a consistency parity test. It tests homogeneity of variance while being sensitive to non-normality, meaning passing this one test is a sign for both normality and homogeneity of variance of the reported personalities.
Given a number of observations
(
), as on a personality trait of an agent, and those on another
(
), consistency parity
p (
) can be computed as
The p here is indeed a p-value. However, what can be considered as ‘significant’ in the context of consistency parity requires support from more empirical results. For now, any choices seem arbitrary. The convention of can be a (very loose) significance threshold, and since we run r tests per dimension, we need to correct the p-value to by applying the Bonferroni correction to counter the multiple comparisons problem because we do not want to disqualify a dimension by chance, thereby relaxing it further. A consensus is topic-dependent, which means that a group of observers can reach a consensus separately on A or B, while a consensus on A and B together is nonsense. Consequently, we need to measure the consistency parities separately (per archetype per trait).
Let
denote a function that performs Bartlett’s test, thus: it takes two real matrices of arbitrary numbers of rows but the same number of columns as the input, and it returns a real vector
of
p-values as the output. The function computes the
p-values between two corresponding columns of the same index in the two matrices. Let
denote the matrix that contains all
p-values of consistency parities. It can be computed as
where
and
,
Where
, and
where
(
) denotes the
mth observer’s report on the
ith robot personality. To disqualify a dimension, we can consider the number of null hypotheses (consistency parity) rejected on that dimension. Considering that the consistency parity testing approach may not be strict, a stricter threshold for the number of rejected null hypothesis can be implemented here to enhance the effectiveness of the test.
3.4.4. Test 3: Robot Personalities’ Fidelity ‘Sanity’ Test
Finally, the fidelity ‘sanity’ test checks whether we can reproduce the results in the fidelity test using the observers in place of the archetypes as the reference points. It tests Sub-Hypothesis 3 on each dimension.
This test is almost the same as Test 1 except that we swap the archetypes with the observers themselves as the reference points. Fidelity is defined as the proximity between the observed robot personalities and their corresponding archetypes; therefore, the proximity between the observed robot personalities and the corresponding observers is not fidelity; instead, we call this quantity pseudo-fidelity. The test follows the exact same procedure as the fidelity test, only that this time the reference points are changed from the archetypes to the personalities of the observers themselves. This test requires data sets , B, and N.
We compute the pseudo-fidelity and generate
s random reports to compute random traits as before. From
N, we get
u matrices:
(
). Because we need the proximity values between the observations of each observer and the observer themselves, we construct
u matrices:
(
), where
where
is the
ith observer’s personality.
Then, we apply the same procedure as in the fidelity test, which can be expressed as
Then, the
p-values of this test, represented as a vector
(
, can be computed as
6. Discussion
The test procedure was put to 101 limited-scope tests, where we simulated an engineering situation where 10 humanoid robot personalities were to be engineered for dyadic communication for small user groups of 18, 15, 12, 8, 5, and 3 users. The corresponding test results for simulated user groups of eight or more people reflect the existing body of knowledge on the potential of humanoid robots for expressing personality traits, thereby confirming the effectiveness of the proposed test procedure within the scope of the experiment for user groups with at east eight people. However, the variations revealed should allow us a glimpse into a common situation of engineering robot personalities, which is that no one answer applies to all cases. The working hypotheses—that not all but some human personality dimensions can be used to engineer robot personalities for non-android robots that can imitate human behaviour—are supported as least within the scope of the experiment. Our test procedure is the first one dedicated to engineering robot personalities in significant quantities for practical applications with small user groups. It is inevitably primitive, and there is much space for improvement. There are several limitations in particular we can focus on.
The primary limitation of the test procedure is that it is currently ineffective for user groups of fewer than eight people. Compatibility with smaller user groups is the most needed improvement considering that most household user groups probably have fewer than eight people. Another limitation is that the test procedure is not a guarantee for the applicability of personality dimensions based on the lexical hypothesis. For guaranteed applicability, we will need to develop dedicated personality models for robots based on lexicons for describing robots, as a recent study did for conversational agents [
9]. Dedicated models can be used together with the test procedure to identify engineering-worthy dimensions for a type of robot. Considering that robots come in different shapes and sizes for different purposes, even a dedicated model does not necessarily fully apply to a particular type of robot. Last but not least, our test procedure is formulated based on our definition of robot personalities. Our definition stipulates that robots be individuals exhibiting characteristic patterns of computation and behaviour with inter- and intra-individual differences, which suits our aim as to engineer robot personalities in significant quantities. However, as there is currently no consensus on how robot personality should be defined and it is questionable whether robot personalities should be imitations of human personalities [
12], our understanding on robot personalities will keep evolving. Our current approach to creating robot personalities needs to adapt and improve accordingly.
The primary limitation of the experiment is that we only confirmed the effectiveness of the test procedure in a limited scope. We do not assert universal applicability. Future research can also aim to expand the applicability of the proposed test procedure and validate it in a broader context. In particular, albeit the test procedure is meant for engineering populations of robot personalities that will be observed together, such as a music band or idol group of robots or a staff of robotic errand runners in an office, the scope of the experiment did not properly reflect this aim. The experiment featured multiple instances of dyadic communication rather than users interacting with multiple robot personalities at the same time. It is arguable that dyadic communication may still represent the most common form of interaction. For example, in an office where humans work with multiple errand runners, it is still more common for a human to interact with one of them at a time. Still, an important future work is to cover interaction involving multiple physically present robots. This would require multiple units of the type of robot to study and abolishing video-based stimuli. In addition, the content of the interaction featured in the experiment, which was about getting to know each other, albeit important for leaving first impressions, did not necessarily reflect the main contents of interaction with service robots. Future studies and implementations of the test procedure would benefit from settings that better reflect the contents of the interaction based on realistic roles of service robots. Another major limitation of the experiment was the simulation of 101 user groups by drawing from a small pool of 18 observers. There were two possible impacts of this approach: underrepresented inter-group variations and intra-group consistency. A main goal of the test procedure is to take into account the inter-group differences of different user groups when engineering robot personalities. However, by drawing from the same pool of 18 observers, the possible inter-group variations were not properly reflected in the results (
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8). The intra-group consistency by randomly combining members of the 18 online observers, who observed the robot personalities by watching short videos, might be lower than that from a real user group interacting in the same environment with real robots.
Another limitation present both in the proposed test procedure and our experiment was introduced by using an imprecise personality measure. This is also a limitation of the field, which currently lacks a personality measure precise enough for engineering tasks. The use of an imprecise measure based on Likert scales not only introduces possible incompatibility issues with statistical tests, it also affects the resolution of presenting archetypes. In practical engineering work, we cannot let two archetypes have the same trait on any dimension. However, traits as measured by an imprecise measure can be the same in terms of their numeric trait levels. Therefore, using human archetypes as measured by an imprecise personality measure should be limited to identifying personality dimensions for further practical engineering work but not the actual engineering work, especially engineering tasks of a certain scale, such as when we need to engineer hundreds or thousands of robot personalities, as suitable for mass-produced robots. To that end, we will need a dedicated personality model with precise, continuous scales for designing and measuring archetypes in greater numbers with better precision. But then again, it is doubtful whether this dedicated model for robot personalities can be used by human archetypes to report their own personalities. The future of robot personalities engineering might need to go beyond imitating humans.