1. Introduction
User experience (UX), as defined by the ISO 9241-210 standard [
1], encompasses a user’s perceptions and responses that result from the use and/or anticipated use of a system, product, or service. This recently developed field of research still faces challenges in defining the scope of UX in general and the application of experiential qualities in particular [
2]. The participation of end users is crucial to make interactions as easy and accessible as possible. Usability testing processes help define elements of this area and barriers to implementation [
3]. However, for usability—as defined by ISO 9241-210 standard as an extent to which a system, product, or service can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context—use testing is mostly performed manually, where users provide feedback using usability evaluation methods such as cognitive steps with a think-aloud protocol and a heuristic evaluation survey [
4].
Effectiveness refers to the accuracy and completeness of achieving goals, while performance considers the resources used to achieve these goals. Moreover, satisfaction reflects the convenience and acceptability of the work system for users and affected individuals [
3,
5]. These aspects can be effectively evaluated by using affective computing, which involves systems that detect, interpret, process, and simulate human actions. Emotion recognition and the study of behavior resulting from users’ emotional states open up new and promising scenarios for a more immersive user experience [
6]. The use of this technology in the usability testing process to automate the detection of critical UI areas is not given enough attention, and user research is an important part of the UI/UX design process because it helps designers understand the needs, goals, and preferences of target users. There are several methods designers use to gather user insights [
7]: surveys for broad data collection, interviews for detailed individual insights, and usability testing to observe user interactions with a product. Combining these methods provides a comprehensive assessment of usability and user needs. Researchers often explore measuring and understanding user experience [
2] and improving UX practices using data science and process automation, particularly in agile project activities [
3]. Marques et al. proposed a UX evaluation technique [
8] that helps identify the causes that lead to a negative user experience. It is crucial to understand end users’ needs and problems to enhance system usability.
The most common applications of emotion recognition technologies are in the development of computer games to enhance the overall experience [
9]; in education, when trying to improve the teaching and accessibility of the program, by taking into account the emotional state of the students [
10]; in medicine to detect health problems earlier [
11]; in advertising companies trying to understand what kind of product the respective market wants; or in customer service by providing the best possible service quality. Medjden et al. investigated the automatic adaptation of the user interface depending on the multimodal emotion recognition system using the RGB-D sensor [
12]. While their focus was on automation rather than UI testing, their work highlights the potential benefits of automating emotion-driven UI adjustments to enhance user experience.
This research addresses a gap in the field by investigating UI testing based on UX principles using emotion recognition technology. Automating UI usability testing is relevant for saving time and improving data collection. Emotion recognition data can enhance the identification of disliked interface features, informing interface improvements. The proposed framework prototype aims to facilitate usability testing by incorporating emotion recognition data to guide interface testing and identify areas needing improvement. This unique approach could benefit developers seeking user feedback to enhance web application UI.
3. Results
The emotion recognition model was trained using transfer learning with the MobileNetV2 architecture on the FER-2013 dataset, using a deep learning algorithm. The FER-2013 dataset consists of 28,709 grayscale face images of various ages, genders, and ethnicities, each with a resolution of 48 × 48 pixels. The dataset is labeled with seven emotion categories: angry, disgusted, scared, happy, sad, surprised, and neutral. The images were pre-processed to ensure that the face was centered and occupied as much space as possible. MobileNetV2 is a deep learning model designed specifically for mobile devices, and implements computer vision tasks in a simple and efficient way [
29]. It is based on the inverted residual structure, where the residual connections are between the bottleneck layers, and lightweight depthwise convolutions are used in the intermediate expansion layer, which is designed to filter functions as a source of nonlinearity [
30]. This architecture was chosen for several reasons: it uses few resources, inverted residual blocks allow extracting important features from filters of different sizes, and the linear bottleneck layer contributes to flexibility and efficiency by reducing the number of parameters.
Several emotion recognition models were tested before MobileNetV2 was chosen as the main architecture. The emotion recognition part of the current version of the Deepsight Toolkit is still under development and currently uses a simple “smile” level system. Real-time emotion recognition from facial expressions performed poorly in tests using Deepface. Using the MATLAB (MathWorks, Inc., Natick, MA, USA—MATLAB R2020a) trained models, the classifier perfectly identifies emotions from the test dataset, but when using the real-time test, the classifier often incorrectly identifies the emotion as neutral (possible reasons: limited emotional expressions in the dataset or their bias, limited training data, and overfitting). Three popular pre-trained models, namely GoogleNet, AlexNet, and VGG19, also failed to correctly identify all emotions (
Table 1).
In further tests, two different algorithms were used to train the emotion recognition model. The first model was trained using the Keras (Keras team (originally developed by François Chollet)—Keras v. 2.15.0) library combined with OpenCV (OpenCV team (Open Source Computer Vision Library)—opencv-python-headless==4.9.0.80) image processing. This particular model was built using the FER-2013 dataset, which contains images with different facial expressions. The CNN model was trained using the Keras Sequential model (a linear stack of layers where one layer can be added at a time starting from the input) with different convolutional, pooling, and fully connected layers. The second model was trained (
Figure 5) using the Keras and TensorFlow (Google LLC, Mountain View, CA, USA—tensorflow-cpu==2.15.0) libraries for emotion classification from images and the eINTERFACE_Image_Dataset dataset. The real-time test also failed to successfully detect emotions.
The proposed MobileNetV2 model was modified by adding additional layers and integrating both input and output layers to create a new version. Learning parameters were set, including the categorical cross-entropy loss function, the Adam optimization algorithm, and the accuracy metric. Finally, the model was trained with the adjusted layer weights using the training data. Training was performed over a selected number of epochs with the available training set size. To prepare the data for training, the samples of the training data set are shuffled to obtain better training results. The model input and output data arrays are then separated into images with facial expression and emotional state labels, respectively. Deep learning models typically process multiple images as a single batch, expecting each image to have a specific format (width, height, and color channels); therefore, the reshape(−1, img_size, img_size, 3) method is used, providing the structure of the training data to the model, including the total number of images and the format of each individual image. Deep learning models often benefit from normalization of image pixel intensities. In this case, the image intensities are normalized from the range [0, 255] to [0, 1] by dividing each pixel value by 255. This improves training efficiency and standardizes the input.
The model was trained several times with different values of a set size, epochs, and number of training data depending on the performance of the equipment used to obtain the best result. In order to train the model for more epochs, it was necessary to reduce the number of images used for training and the size of the set, because the equipment used (Lenovo Legion Y540-15IRH 15.6” (Lenovo, Beijing, China)/FHD IPS/i5-9300H/RAM 16 GB/SSD 256GB/Nvidia GeForce GTX1660 Ti/6GB G6/Windows 10 Home) was not able to handle such a complex task. The test results are shown in
Table 1.
The first verification test of all four trained models was performed by using AI-generated images containing faces with different emotions. The first three models had difficulty recognizing disgust, sadness, and fear because these emotions have certain facial features in common with other emotions: for example, a frightened person may have a gaping mouth as in the case of surprise, or a disgusted face may have furrowed eyebrows to signal an emotion of anger. The fourth model, with the highest accuracy, showed a significant difference from the previous models, successfully recognizing 7 out of 7 emotions (
Figure 6).
Since the fourth model showed high accuracy in consistently recognizing seven out of seven emotions using AI-generated images that were not a part of the dataset used to train the model in the first test, a real-time test was conducted to learn how the model would perform in real-world conditions.
The second test was conducted with the following setup: (1) the test subject (author—white male, 31 years old) was facing front of the laptop camera; (2) the laptop camera was 720p and fixed focus was used; (3) there were good lighting conditions on the face using natural light sources. The Haar classifier from OpenCV was used for face detection, and the face region was extracted for the detected face. This region was reduced to 224 × 224 pixels to prepare the input image for the model. The model then predicted the emotion based on the pre-processed image, and the result is displayed in real time. All seven emotions were recognized during the real-time test, as shown in
Figure 7.
Since the real-time tests with the trained emotion recognition model showed promising results, the following software was selected to track the user’s actions in the web application. For this purpose, the Hotjar tool was used, which allows monitoring the behavior of website visitors, analyzing browsing habits, and obtaining various information about how users use websites. Hotjar offers several features, but the most important one for this research is the recording feature. The recordings section allows us to monitor user sessions and view recordings that show how users move and interact with pages, and where they encounter problems or challenges. In addition, Google Analytics GA4 software was chosen to collect data on the usability of the web application interface. Analyzing user behavior is essential to understanding how visitors interact with the website, what they do, and how their experience can be improved. The integration was carried out by using the tracking script of both applications on the test website.
The emotion recognition program is activated by sending HTTP requests from the web application in use to the API service. To achieve this, the AWS platform was used by configuring the Lambda function using a Docker image that is located in the AWS ECR (Amazon Elastic Container Registry).
Base64 encoded data is decoded during frame capture. The result is a decoded byte object of the frame data. These bytes are then converted to a numpy array using an 8-bit unsigned integer type. The resulting array is interpreted as an image array, since each byte corresponds to a pixel value. Finally, the video is decoded using the OpenCV library. The result is an image array that is further used for face detection and emotion recognition. Since the Lambda function is used, the program code requires a Lambda handler to handle events. This method is invoked when the Lambda function is called. The general syntax for a Lambda handler takes two arguments: event and context.
Having a working Lambda function requires a trigger, which in this case is an API gateway. The Amazon API Gateway tool is used for API development. Since the RESTful API has a simpler client–server interaction model and the camera in the web application is configured to send HTTP POST requests to the server with fixed face data, this API type was chosen.
Two use cases were used to test the framework:
Use case #1—logging recognized user emotions, recording and heat mapping user activity;
Use case #2—detecting specific user actions (e.g., rage click), reviewing the Google Analytics page to determine the relationship between the recognized emotion and the specific user actions on the page.
A short user session was performed during the usability test. The test session was performed by five users using the web application that was specially developed by the author of this research. The purpose of the test scenarios is to show how parts of the system prototype—i.e., emotion recognition data, session logs, and interaction registration data—can be adapted and how their inclusion in UI usability testing can provide additional context to the UX. During the session, the users used the web application, browsing various pages, clicking buttons, and testing the functionality of UI. The web application design was improved to have more functionality and variety for usability testing. The home page included different types of blocks: headings, paragraphs, images, quotes, additional content sections, tables, and footnotes. Using the editing capabilities provided by WordPress, an options menu was created with two choices: Posts and About pages. Additionally, previously created blog posts were added to the posts page, and a new post was added to the archives and posts section on the right side of the dashboard. The main task for the users was to use all of the functionality offered by the simple blog-type web application, which had parts of poor UI design intentionally added to it, in order to collect emotion data for comparison with the session record. After the session, the CloudWatch logs are checked by selecting group /aws/lambda/EmotionRecognitionProject. By specifying the absolute date, it is possible to select the time period during which the user’s test session took place. To filter out unnecessary data, such as the start and end of the event and additional warning messages, the search window is used by entering the keyword “emotion” for events where only an emotion was recognized, or “Lambda” for events where emotion recognition failed; for example, when no face was detected. In this way, the necessary emotion recognition data is obtained, which is tracked by timestamps to compare with the information provided by the Hotjar recording. The proposed scenario is more recommended for web applications that are already being visited by users for further improvement, rather than for systems that are still in the development phase.
When examining the data, the general predicted emotion should be neutral, but there may be unwanted noise. At this point, the noise filtering is performed manually by going through different emotions and looking for a cluster of negative emotions which can be used to compare with the Hotjar recording to see if the user encountered any UI problems (
Figure 8).
In use case #1, the recording section of the Hotjar platform is opened, and a record is selected based on the corresponding timestamp based on the CloudWatch log data. When the corresponding timestamp of the recording is checked, it shows that the user has double-clicked on an image from the image gallery (
Figure 9).
With this knowledge from the recording data, we can also compare it to a heatmap to see if the UI has a real problem that users tend to get irritated by. The heatmap of Hotjar shows the top three clicks on the page under investigation, and it is clear that double-clicking on the small image is common, as it is identified as a top click (
Figure 10).
The heatmap analysis revealed that double-clicking on small images was a common interaction pattern on the page under investigation. This finding suggests a potential usability issue related to image size and user frustration. Use case #1 helped to identify a bad UI design decision where images are small and difficult to read without an option to change their size, which could cause negative emotions for the user.
In use case #2, the examination of the event log revealed two groups of the Fearful emotion that stood out from the overall data. Following the same procedure, the timestamp of the recording on the Hotjar platform is selected accordingly, highlighting a clear problem in UI design. The Hotjar system itself marks this timestamp as a “rage click”. Rage clicks occur when users repeatedly click on a particular element or area of a web application over a short period of time, usually less than a minute. Multiple clicks on the right arrow button can be seen when viewing the replay and checking the action menu (
Figure 11).
Use case #2 helped to identify a confusing navigation of the page on one of the blog pages. Having this information, a website developer can rethink existing user interface design decisions, such as not using arrow buttons if there are not enough related articles. It is difficult to tell what usability problems may have been overlooked, but the emotion recognition data showed a real time user experience which helps to improve the usability of the web application. Use cases of other users are shown in
Figure 12,
Figure 13,
Figure 14 and
Figure 15.
Using the user interaction data collected by the Google Analytics platform, it is possible to decide whether it is worth making design changes if, for example, the page where the problem was found is not popular. In this case, the error was found on the title page and the “Update” message blog page. In the Google Analytics Events tab, the value “page_view” is selected from the table, showing the statistics of all page views. The report shows that, as expected, the title page is viewed the most, but the “Update” message post on the blog is not visited very often (
Figure 16).
Google Analytics revealed that the title page is highly visited, while the “Update” message blog post is not frequently accessed. This data helps in understanding the impact of UI issues based on the popularity of the affected pages. It indicates that making UI changes on less popular pages might have less of an overall impact compared to changes on highly visited pages. As the problematic area is only embedded in a page that is rarely viewed, the benefit of changing the user interface in this case can be weighed against the cost of the resources required to change and improve it. The testing and data validation shows that such a system can provide useful data for web application usability testing.
4. Discussion
The use of deep learning algorithms and data analysis makes the evaluation more objective, as it is based on mathematical models and actual data rather than the subjective opinion of the evaluators. This shift to data-driven assessments enhances the credibility and reliability of UX research results. Testing takes place in a real-time environment that reflects the actual user experience. This allows for a more accurate assessment of how users interact with the site and to identify real usability weaknesses. This real-time feedback can inform rapid design iterations and continuous improvement efforts. Incorporating emotion recognition into UX research emphasizes a user-centered design approach. By understanding users’ emotions, developers can create more empathetic and intuitive web application interfaces that are tailored to users’ needs and preferences.
One could think about automating the system on a larger scale. Automated notifications when large numbers of negative emotions are captured would allow for quick response and resolution of issues, saving human resources and time. Automated testing processes could easily scale to large data sets and large numbers of users, enabling broader and deeper testing. These benefits show that usability testing with such a framework could be more efficient and more accurate in reflecting user needs and behavior than traditional web application testing methods.
The emotion recognition technology could include more advanced algorithms or use additional sensors to provide more accurate data about users’ emotions. Better emotion recognition results could also be achieved by ensuring that the lighting and camera quality are adequate to obtain clearer and better-quality images. It is also very important to balance the angle and composition of the shot to ensure that the angle of the cameras best matches the area of the face. To ensure that the emotion recognition system is universal and includes people of all nationalities and skin colors, the algorithm should be improved or more training data should be added to better recognize the emotions of people of different nationalities and skin colors.
To obtain more information from users—and in line with usability testing methods—using additional features of Hotjar such as surveys and interviews would allow us to obtain valuable feedback from users and better understand their needs and behavioral patterns. As the number of users grows, more time should be spent delving further into Google Analytics and taking advantage of this powerful tool, including a better understanding of on-page analytics, user flow tracking, and conversion analysis, which can provide valuable insights into user behavior and website usage. The integration with tools like Hotjar and Google Analytics enhances the framework’s practicality by leveraging existing UX research methodologies. This interoperability streamlines data collection and analysis, making the framework accessible to UX researchers and developers.
Currently, the framework focuses on analyzing facial expressions from images captured by the camera. To extend its capabilities, speech analysis can be integrated by capturing audio at the same time as images. This would involve modifying the client side to capture both image frames and audio data, sending them to the Lambda function for processing. The Lambda function then parses and decodes the audio data and applies a separate audio emotion recognition algorithm trained on labeled data to identify emotional states. Techniques such as CNNs or RNNs that use audio features such as pitch and intensity can be used for this purpose. As datasets and input complexity increase, optimizing image and audio processing becomes critical for scalability. Implementing techniques such as batch processing and parallelization within the lambda function optimizes the efficiency of image and audio processing to handle larger amounts of data. Distributed processing techniques, such as using AWS Batch, can be explored.
The framework’s architecture, built on cloud technologies like AWS Lambda and Docker, offers flexibility and adaptability to different types of web applications. Adapting the proposed framework for other applications, such as mobile apps, could involve developing SDKs for iOS and Android. These SDKs would handle tasks such as capturing images or video frames using device cameras and sending those frames to AWS Lambda for emotion recognition. Instead of Hotjar or Google Analytics, which are used for web applications, Firebase Analytics could track mobile-specific interactions such as app usage, touch gestures, and user engagement. It would also be recommended to create custom CloudWatch metrics to monitor specific aspects of Lambda function performance and to set up CloudWatch alarms to notify the system administrator when certain thresholds are crossed, allowing proactive management of resources.
5. Conclusions
This study investigated the development of a usability evaluation framework for web applications based on user emotion recognition in real-world scenarios. It is recommended to explore the development of more robust emotion recognition algorithms capable of accounting for diverse demographic factors, facial expressions, and cultural nuances. Additionally, the incorporation of multimodal interaction analysis, combining emotion recognition with gesture recognition, speech analysis, and physiological signals, could provide a holistic understanding of user engagement and satisfaction. Using the selected technologies, a prototype of the proposed framework was created, which helps to evaluate the usability of the web application based on emotion recognition, session recording, and interaction registration. Exploring applications in areas such as mobile interfaces, virtual reality, and interactive systems could extend the usability evaluation capabilities of the framework to cross-domain assessments.
During a testing session, the proposed framework prototype demonstrated its ability to effectively capture and analyze user emotions using real-time data monitoring and analysis. Using CloudWatch log analysis and comparing the obtained data with Hotjar recordings, it was possible to identify problematic areas of a web application where users experience negative emotions. For future work, the integration with emerging technologies such as natural language processing (NLP) to expand the framework’s capabilities in novel HCI contexts could be explored. Future work could explore the integration of AI-driven insights and automation within the framework to provide actionable recommendations for optimizing user experience based on emotion analysis. Additionally, developing customizable modules within the framework may accommodate specific user demographics, cultural contexts, and application domains.
Based on the successful testing, data validation, and complementary strengths of emotion recognition, session recording, and user interaction analysis, the proposed framework prototype holds great promise for evaluating web application usability based on user emotions in real-world scenarios.
Despite its strengths, this framework has certain limitations, such as reliance on specific hardware for emotion recognition. Future research could focus on enhancing the framework’s adaptability to different platforms. Nonetheless, this approach empowers developers to make data-driven decisions for optimizing user experience.