1. Introduction
At present, a proliferation of groupware applications (social networks, shared editors, messaging services, etc.) can be observed, enabling collaboration between users The computer-supported collaborative work (CSCW) research field [
1] has focused on studying how technology can effectively support these collective processes. One of the main purposes of the CSCW has been the identification of methodologies that provide a systematic approach for discovering the users’ requirements, and for evaluating the degree to which these systems allow users to be aware of, modify or interact with the work of other users.
One of the most frequent characterizations of the groupware systems is based on the following two dimensions of their contexts of use [
2]: (1)
time, whether users collaborate synchronously (they collaborate at the same time) or asynchronously (they collaborate at different moments); and (2)
space, whether users are co-located or geographically distributed. Currently, most users of smart-phones use groupware applications that allow them to collaborate synchronously and/or asynchronously while being geographically distributed. Therefore, the evaluation of these groupware applications should assess the versatility of the systems to support synchronous and asynchronous interactions and whether the system is adapted to the users’ mental model so they do not have to make any effort to use the system’s features.
In order to carry out these evaluations, this article describes a methodological approach (see
Figure 1) that identifies the context of use in which the users perform the interactions (synchronous or asynchronous) and generates descriptive models of how these users orchestrate the interactions with the system. Finally, these models are processed to create a descriptive text in natural language of the main characteristics of the users’ interaction with the system. Thus, the designer can verify whether the system offers a natural interaction experience to its users in each context of use and whether users deviate substantially from the task model used to design the system. The task models are specifications widely used in the human–computer interaction (HCI) field to describe the logical activities that have to be performed in order to reach the users’ goals. Therefore, the evaluation of the system should analyze the users’ behaviors to verify that it is not hard for them to follow the sequence of actions specified by the task model. Moreover, the users of the system can have different behaviors so the task model must also be flexible enough to enable the users to reach their goals in different ways. With this aim, we propose a methodology that processes log repositories with information of the user interactions and also models the behavior of the users. Then, the users with similar behaviors are grouped together and for each set of similar users, a profile (in the form of a weighted automaton) is generated. Finally, these weighted automata are transformed into descriptive text that capture the characteristics of those different user profiles, which is valuable information that allows for the design of a task model according to the users’ natural behavior. With respect to our previous work [
3], we introduce two new models for the users’ behavior, one of them allowing for more flexibility (the user is no longer represented by a fixed-length vector), and we highlight the pros and cons of using any of the three proposed models. The experimental section is modified accordingly.
Note that user modeling can be understood in a more general context (from an artificial intelligence perspective), as the process through which systems obtain information about the individual characteristics of users [
4,
5]. On the other hand, in the HCI field, a lot of work has been done on ontology-based models, where three aspects are considered: user, context and device [
6,
7,
8,
9] (we refer the reader to [
10] for a very good chronological review of the evolution of user models).
2. Related Work
Suarez et al. [
11] establish a classification of interactive systems and point out that specific evaluation criteria should be applied to each one, according to their features and complexity. This classification includes the groupware as a specific type of application. Among the specific features of the groupware applications, it is noteworthy that they enable not only the communication between users but also the collaborative building of artifacts in a shared workspace. The groupware applications support user interactions whose effects can be perceived by at least another member of the group or another community. Therefore, the evaluation should consider specific criteria such as the effectivity of the awareness and social support of the application. Molina et al. [
12] propose a combination of techniques (questionnaires, empirical testing, heuristic evaluation, eye tracking) to analyze the awareness support of groupware applications. Neale et al. [
13] propose some specific measurements to evaluate the degree to which an application intuitively enables social interactions.
These evaluations have been automated using tools such as Tatiana, which is independent of any groupware system and allows us to configure and to automate the analysis of interactions recorded in log files by means of several socio-cognitive methodologies. Moreover, Tatiana provides support to perform non-automated analysis, where collaborative activity is reproduced by video and the users introduce annotations or categorizations about the users’ interactions. ProM [
14] is another tool that facilitates this kind of analysis: it allows users to select different data mining algorithms in order to analyze the work-flows recorded in log files.
Thaler [
15] uses the term
usability mining for the evaluation process of information systems based on analyzing log traces. In this case, the goal is to derive a model of usage of the system that includes information about the users behavior (irrelevant actions, undo actions, using help function, etc.). The users’ actions can be analyzed using statistical techniques that generate quantitative information. These measures are known as low-level indicators [
16], as they usually do not provide an interpretation of the user’s activity. The indicators whose values provide an interpretation (cognitive, technological, etc.) of the user’s activity are known as high-level indicators.
3. Data Flow Representation
The data flow generated by the interactions of users with a computerized system can be seen as a stream of tuples of the form , where is the user’s identifier in the system and t represents the instant of time when the action a took place.
More precisely,
is a sequence of the form
where each
belongs to
,
is a finite set of user identifiers,
is a finite set of instants in which the process is observed and
is a finite set of actions. We assume that
, for
. The information contained in
can be segmented according to different criteria. If we are interested in the case in which
is partitioned according to the user identifiers,
, each segment of information is of the form:
That is, segment is the subsequence of formed by those tuples in which action is performed by the user whose identifier is (the user’s id information is the same for all tuples in and thus it is omitted). We call segment a trace (also path or trajectory) of user in the system.
For data analysis and user profiling, one might be interested in the duration of transitions between actions and not in the exact instance of time in which actions take place. In such cases, we can rewrite the trace
as a new path of the form:
where
is the time elapsed between action
and action
, for
, and thus is a continuous attribute. We can discretize these durations into a fixed number of categories, say
, where each value
represents a time interval chosen ad hoc for the system under study.
4. User Interaction Models
The problem of modeling the user interactions can be approached from two very different perspectives. One possibility is to identify each user with its trajectory and determine which users behave similarly using a path-similarity measure. The other option is to find a generative model that behaves in the same way as the user does in the interactive system. A priori there is no restriction on the nature of the generative model, which could be a Markov model, an automaton, a Bayesian network, etc. We explore here a Markov-like structure called weighted automaton.
Definition 1. Weighted automaton (see [17]). Let be a finite alphabet of symbols and n a positive natural number. A weighted automaton over with n states is a tuple , where and are vectors in representing features of the empty prefix and of the empty suffix, respectively, and is an -matrix with real entries representing transition weights. In some situations we can omit and if they are not relevant for the problem. Let
be the set of actions that can be performed by a user in a given interactive system. For each user
, we build the path
as in (1). In this path, actions
are in
and durations
and belong to some interval
c in
, where
is a finite set of time intervals as explained in
Section 3. We define the weighted automaton
over the alphabet
with
states, in which the matrices of weights are defined as follows:
where
, that is, the number of times action
a precedes action
in the trajectory
of the respective user and the time elapsed between the two actions is in the time category
c.
Depending on the application under study, we may be interested in the sequentiality of the user’s actions (see Model a below) or in the frequency with which each action is performed (Model b). Alternatively, the user’s behavior can be described by its path, which also leads to further categorization based on the number of previous actions we consider (Model 0, Model 1, Model 2, etc.).
Model a: we identify the user
with a point of the affine space
defined by,
Model b: the behavior of user
is a point of the affine space
defined by,
Model p,
: we identify the user
with a trace of variable length defined by,
where
is the time interval to which
belongs and
. In particular, if
, we have:
Determining whether two users have a similar behavior can then be done via distance measures or similarity measures. If users are represented as paths of variable length, their similarity can be calculated using a Monte Carlo estimation of the crossed entropy between their respective traces (see Equation (7) in [
18] or Equation (11) in [
19]).
where
and
is the indicator function of the set
.
Moreover, to improve estimations when the available traces are too short and do not cover the entire set of available actions (along with their respective time intervals), we can use Laplace smoothing, and modify the equation as follows (see Equation (8) in [
18] or Equation (12) in [
19]).
The more similar traces are the smaller values of we obtain, but the zero can only be reached when both traces consist of a constant number of identical elements. For this reason, cannot be considered a distance function. Also, note that in general and are not equal. Nevertheless, this can be fixed by using the following formula.
For users represented as vectors of fixed length
and
, the distance between them can be calculated using the
Minkowski metric:
The commonly used Euclidean distance is obtained for , the Manhattan distance (also called city block) for and the Chebyshev distance for .
An alternative concept to that of the distance is the similarity function. When the angle between the two vectors is a meaningful measure, one may consider the
cosine measure,
where
is the Euclidean inner product in
, and
is the norm induced by the inner product, or the
normalized Pearson correlation:
where
denotes the average feature value of
x over all dimensions.
In practice, we use and in order to have as in the case of the above mentioned distances.
Note that each of the distances introduced so far corresponds to different goals, and choosing one of them should be made accordingly. In the sequel, we outline their differences using an oversimplified example in which we have only two possible actions ( and ) and we consider only one time interval ( and ).
Example 1. Let , and be the traces of three different users, of length 4, 40 and 4, respectively (the time interval is dropped for better readability).
| Model a | Model b | Model 0 | | Model a | Model b | Model 0 |
| (0, 2, 1, 0) | (2, 2) | | | (0, 2, 1, 0) | (2, 2) | |
| (0, 20, 19, 0) | (20, 20) | | | (1, 1, 1, 0) | (3, 1) | |
| 36.000 | 36.000 | - | | 2.000 | 2.000 | - |
| 25.456 | 25.456 | - | | 1.414 | 1.414 | - |
| 18.000 | 18.000 | - | | 1.000 | 1.000 | - |
| 0.044 | 0 | - | | 0.225 | 0.106 | - |
| 0.081 | nan | - | | 0.478 | nan | - |
| - | - | 0.693 | | - | - | 0.752 |
Based on the Minkowski metrics, user 3 is much more similar than user 2 to user 1 in both Model a and Model b, while the metrics that measure the angle between vectors show user 2 to be much more similar than user 3 to user 1 (the equality of Model a and Model b values for the Minkowski metrics are fortuitous). The Monte Carlo-like estimation also marks user 2 as being more similar than user 3 to user 1. Therefore, if one wants to group together those users that perform actions in a similar fashion ignoring the length of their traces, then Minkowski metrics should be avoided. In contrast, these metrics are highly recommended when the length of the trace is an important user behavior aspect for the application under study.
5. User Interaction Group Profiles
Next, we are interested in clustering the users into a certain non-specified quantity of
representative user profiles such that users in the same group (cluster) behave more similarly to each other than to users in other clusters. To this end, we use agglomerative hierarchical clustering with three linkage criteria: single, complete and average (see [
20] for more details). Given a set of
l users to be clustered and a fixed number
of desired clusters, the basic process of our hierarchical clustering is as follows.
Step 1. Start by assigning each user to its own cluster, so that if we have l users, in this initial stage we have l clusters, each containing just one user.
Step 2. Set the distances between the clusters equal the distances between the users they contain.
Step 3. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now we have one less cluster.
Step 4. Compute distances between the new cluster and each of the old clusters.
Step 5. Repeat steps third and fourth until users are clustered into k clusters.
Depending on the type of linkage chosen, the distance between two clusters is computed with one of the three formulas:
single linkage:
complete linkage:
average linkage:
where can be any of the distances previously defined.
We denote by
k the number of representative user profiles in an interactive system. The correct choice of
k depends most of the time on the application. The optimal
k will strike a balance between maximum compression of user profiles using a single cluster, and maximum accuracy by assigning each profile to its own cluster (having one cluster per user). If an appropriate value of
k is not apparent from prior knowledge on the properties of the profile set, it must be somehow determined. In the literature, there are several proposals for making this decision effective (see [
20] for a survey). One popular proposal is to use the so called
elbow method. The intuition behind this method is that one should choose
k such that adding another cluster does not give much better modeling of the data.
However, because of the myriad of different settings (users represented as traces or as vectors, different distance metrics, different number of time intervals, different linkage criteria, etc.), we face a supplementary complication in determining the best k (see Example 2).
Example 2. Let us consider the following users (represented as vectors):and assume we want to perform hierarchical clustering using the Manhattan distance with two linkage criteria: single and complete. The hierarchy of clusters obtained for both linkage methods happens to be the same (see Figure 2a,b): , but if we were to plot the number of clusters k against the distance between the last two clusters merged when transitioning from the best k-clustering to the best clustering (according to the particular linkage method chosen), one would get a contradictory situation. Namely, although we have the exact same clustering, the change in tendency is when if single linkage is used, and when using complete linkage (see Figure 2c,d). Therefore, we propose a uniform way of deciding the best
k. Given a set
of users and a particular partition
of this set into
k clusters, we compute the
-dimensional vectors
as per
Model a (see Equation (2)) and then we evaluate the
within-cluster sum of squares (WCSS) error:
where
is the centroid of the points representing users in
.
Note that, in principle,
Model b could also be used to represent users, but
Model a offers a finer description. Moreover, partitioning-based clustering methods allow us to find groups of similar users that optimize the WCSS error. In particular, the
k-means algorithm, a heuristic method commonly employed that converges quickly to a local optimum (see [
21]), was used in a related study [
22]. Nevertheless, there are several reasons for which hierarchical clustering is a better option in this case:
k-means is very sensitive to outlier examples (such examples can affect the mean by a lot),
k-means works well only for round-shaped, and roughly equal sizes/density clusters (and performs badly if the clusters have non-convex shapes),
in hierarchical clustering, we do not actually need the value of k; instead, the clusterings obtained for different values of k can be “visualized” (via the dendrogram) even for points in higher dimensional spaces (helping decide where to cut).
Once k is chosen and the groups of similar users are identified, the next step is to find a model that describes the profile of each group of users. Note that groups of users that contain less than 10% of the population are considered outliers (the threshold can be modified depending on the application). We propose to train a weighted automaton for each group of users (representing a cluster). We define the group profile of cluster G as the weighted automaton over alphabet with states, in which the weights matrices are defined as follows.
Initial and final probabilities can be similarly defined, but they are not relevant in this case so we omit giving explicit formulas. According to [
23], the previous automaton maximizes the likelihood of the observations.
6. Case Study: Collaborative Sports Betting
Our proposal was applied to a case study in which thirty users interacted with a mobile groupware application that supports sports betting. These thirty users were randomly grouped in ten groups of three members. They were requested to collaboratively make five bets. The interactions of the users with the application were stored in a log repository.
The process of creating a bet is made up of three main steps. First, a user proposes a bet to the other members of the group. This proposal includes a result of a sport event and an amount of money. Second, the members of the group use a chat tool (see
Figure 3, right) to analyze the result and the stake of the bet. Third, the members of the group use a voting panel (see
Figure 3-center) to accept or reject the bet that has been proposed.
Figure 3, left, shows the main user interface of this application with its six panels that allows the user to: propose a new bet to the other members of the group, see the state of the bets or proposals made previously, use a chat to discuss about a proposal, and create a new group of users or see a tutorial that explains how to use the application. The
appendix of this paper includes a specification of the actions supported by the application.
We have set and discretized all time durations into two intervals. The precise intervals are (time is expressed in seconds): , that is, corresponds to short actions (that take less than one minute and three quarters), and corresponds to long actions (longer than one minute and three quarters). The threshold is set such that 50% of all possible durations are short and the other 50% are long.
For each of the three linkage criteria, we generated all possible clusterings (each clustering containing
k groups, with
k in {2, …, 30}) for all the models listed in
Table 1, each of them with the corresponding distances. This accounts for a total of
distinct ways of grouping the same set of users, albeit many of them give the same output. In a real life setting, one should be able to choose a priory which model and which measure of similarity is the most appropriate for the application under study. Nevertheless, here we perform an a posteriori analysis, which allows us to select the best combination for our case study.
First of all, we turned our attention to the linkage criteria and we observed the following:
The single-link or average-link clustering methods detect many outliers before starting to output reasonable sized groups. We report below (see
Table 2) the values of
k at which a second group of at least three users is identified.
Out of the total of
groups created for different values of
k, only a very few of them were proper (having at least 2 members) in the case of average linkage and even fewer for the single linkage criteria. The number of proper groups is presented in
Table 3.
Therefore, we concluded that the complete-link clustering methods usually produce more compact clusters and more useful hierarchies than the other two clustering methods.
Once chosen the type of linkage, we had to determine which of the five models should be used, and which is the most appropriate measure of similarity between users. As mentioned in
Section 5, we plotted the WCSS error against the size of the clustering. The obtained graphics are illustrated in
Figure 4.
Inspecting the plots, we can see that for our case study, the three Minkowski type metrics are better: there is a clear steep for small values of k, followed by a smooth descent after the “elbow” point. Also, in almost all models, seems to be an inflection point. And between Euclidean, Manhattan and Chebyshev, we choose Manhattan (with ) because, in comparison with the two other distances, Manhattan provides the highest number of proper groups and the lowest k at which a second group of at least three users is identified. Note that for the chosen k, Model a and Model b give the same clusterings (this is not true though for or ).
The weighted finite automaton generated for each of the two bigger groups identified with this metric are presented in
Figure 5 (the other three groups contained at most two users each, and their characteristics are described in
Table 4). A description for each action is provided in the
Appendix. A transition from a state
to a state
labeled
has to be interpreted as
. Note that for a better readability, transitions with weights less than 0.1 are omitted.
These automata provide information used to generate a set of rules that enables us to build a description in natural language of the main characteristics of each user profile. Thus, this methodology allows us to build expert system that take as input log files with the users’ interactions and generate as output a textual description of the main user profiles. The following five rules are applied for this purpose:
Rule 1: If for a cluster The members of G can be considered as outliers.
Rule 2: If such that The members of G have used the system in an asynchronous context.
Rule 3: If such that The members of G have used the system in a synchronous context.
Rule 4: If such that →During the interaction process, the members of G never used action a of panel .
Rule 5: If such that →During the interaction process, the members of G never used panel .
Table 4 includes a textual description of the five profiles generated in this case study by means of these five rules. We can observe that the application was often used in a synchronous way. Moreover, we can observe that most users never perform some of the actions supported by the voting panel and some users (groups 2 and 4) access this panel but they do not use all its actions. By simply checking the generated weighted automata, evaluators of the system can easily detect that the actions of accept and reject are very seldom used. The users of the third group accept some proposal of other partners to generate new bets in the list of active bets. However, these users use the chat to answer the proposal of bets that they reject. This illustrates how the automaton and the textual information enable designers to understand the natural user behavior and adapt the system to them. Previous work [
3] identified situations in which the users only use some of the actions supported by a specific panel. This new methodological contribution enables us to generate also profiles of users who never use some panel of the interactive system.
Note that there are some actions that are never used by any of the users (actions and of the proposals panel and actions of the chat panel. These are included separately in the final report and are ignored in the description of the user profiles since they are not specific to any of the five identified profiles.
7. Conclusions
We have proposed a methodology for the automatic generation of user interaction models in interactive systems that combines unsupervised and supervised learning. We record the logs of the users in the system in the form of traces and, after a pre-processing phase consisting of describing actions and durations, we use these trajectories (either directly or by compressing them into feature vectors) as input to (various) agglomerative hierarchical clustering algorithms (unsupervised phase) in order to obtain groups of users by similarity of the empirical distributions of actions, durations and transitions. Once this process is completed, we model each group of users by means of a weighted automaton (supervised phase). This finite state machine is what we call a user profile. User profiles, as intended in this paper, constitute a low-level representation of user patterns in interactive systems. From the low-level representation we derive a profile in text form using rules that automatically generate a text identifying some features being representative of the group of users under consideration. As future work we plan to apply our technique to inverse software engineering, using the weighted automaton to infer a task model of the interactive system under study that may help the software engineer to improve the system’s design.