1. Introduction
Advances in technology and digital media advertising have enabled new approaches to measuring consumer engagement and exposure to online advertisements (ads) [
1]. One area of advertising that is growing rapidly is ad-supported video streaming, which has overtaken video-on-demand streaming [
2]). One advantage of these digital advertising platforms is that viewers’ responses to advertising can be measured directly and continuously [
3]. This is now possible with unobtrusive, wearable sensing devices, such as smartwatches, that have a plethora of sensors, including accelerometers, gyroscopes, heart rate, blood flow, electrodermal activity, and skin temperature. Video-on-demand advertising would benefit from such wearable technology as it provides quick and accurate insights into consumer engagement and exposure to ads based on physiological signals.
Advertising evokes emotional responses and triggers cognitive processes in consumers [
4,
5]. This, in turn, influences individuals’ physiological responses and can provide new insights into consumer behavior. Owing to the steady development of wearable devices with physiological sensors, physiological signals may provide a way to directly and continuously measure the effects of ad exposure and better understand consumer behavior [
1].
This study investigated whether physiological signals could be used as a measure of emotional engagement in video advertising. Emotional engagement is defined as “the amount of sub-conscious “feeling” going on when an advertisement is being processed” [
6] (p. 67). It is one of the key aspects of understanding engagement in ads, as elevated arousal has been shown to increase engagement behavior [
7]. We hypothesized that physiological signals, known to be associated with emotional arousal and valence [
8,
9], could provide a reliable and unobtrusive measure of ad engagement.
To this end, an observational study was conducted to assess the participants’ physiological and affective responses to video ads. The focus group comprised younger adults who use streaming services extensively and are accustomed to in-video ads [
3]. The ground truth for ad engagement was collected using the User Engagement Scale-Short Form (UES-SF) [
10], an established psychometric instrument for measuring the affective and behavioral aspects of engagement. Physiological signals of heart rate, electrodermal activity, pupil dilation, and skin temperature were recorded as responses to video ads, along with measures of affect. Machine learning was used to model ad engagement as a classification problem, and physiological and affective responses to ads were examined as predictors of lower and higher engagement, respectively.
The presented study contributes to the existing body of knowledge by highlighting the potential of machine learning and signal fusion to improve emotional ad engagement evaluation. The main contributions of this study are as follows: (1) it demonstrates which physiological signals and their features are effective predictors of ad engagement; (2) it shows that the process of signal fusion can maintain classification performance while reducing the number of features; and (3) it shows that predictive modeling works best when signal fusion is employed.
In the following section, we first present related work. In
Section 3, the materials and methods used in the experimental study are presented. Details of the experimental design, psychometric and physiological measurements, signal processing, and statistical and machine learning tools and procedures are provided. The results of the statistical analysis, signal fusion, and classifier evaluation are presented in
Section 4. The article concludes with a brief discussion of the results and possible directions for future research in
Section 5.
3. Materials and Methods
An observational study was conducted in which the explanatory variables were participants’ physiological signals and affect, with the response variable of ad engagement defined by the existing psychometric instrument of user engagement. The following steps were performed: 1. determination of the target group of users (young adults); 2. selection of ads used in the experiments; 3. design of the experimental procedure; 4. selection of features based on physiological sensors and affective dimensions to be used later in machine learning; 5. selection of validation, evaluation, and performance metrics; 6. creation of machine learning models (classification of ad engagement); and 7. explanation of the models using SHAP.
3.1. Participants
Fifty young adults participated in the experiment (34 females and 16 males; age M = 21.70 and STD = 2.36). Only the heart rate signal was recorded for all 50 participants; GSR and skin temperature were recorded for 47 participants, whereas reliable eye tracking data were recorded for 33 participants. To address the challenges of missing sensor data, machine learning algorithms that can handle missing data were used to classify ad engagement.
3.2. Ad Selection
The video ad materials were carefully prepared. Twelve ads were selected from the YouTube streaming platform. To address the different levels of engagement, these materials were selected in consultation with three marketing specialists from The Nielsen Company. The content was in English and originally aired in the United States. A crowdsourcing study on Clickworker (
https://www.clickworker.com, accessed on 12 October 2022) was conducted to determine user interactions with YouTube’s twelve video ads. Ratings of ad engagement were collected from 360 participants (ages 18–24) who answered the question “How engaging is this ad?” on a 5-point scale (none, slightly, medium, strong, very strong).
3.3. Experimental Procedure
Equal ambient and viewing conditions were maintained throughout the study for all the participants. The experiment was conducted in a simulated living room, in a controlled environment that ensured consistent artificial lighting (no windows), constant temperature (air conditioning set to 24 °C), and quiet conditions. The room size was 4.0 m × 3.8 m × 2.5 m
, and the walls were white. The lights in the room were dimmed and illuminated at 150 lux, as suggested by [
40].
The experimental design involved all the participants viewing and rating all four ads assigned to them within their designated shuffle set. This was done to control for possible carryover effects of viewing one ad to the next, as the engagement triggered by the previous sequence could influence the participant’s response to the next ad. Four sets were generated, resulting in four combinations (Set1: Ad1, Ad2, Ad3, Ad4; Set2: Ad1, Ad3, Ad2, Ad4; Set3: Ad4, Ad2, Ad3, Ad1; Set4: Ad4, Ad3, Ad2, Ad1). The number of combinations was limited to four to keep the duration of the experiment manageable. Nevertheless, the ads are arranged so that no ad precedes the same ad more than once.
Next, the four sets were randomly and evenly assigned to the participants (considering age and gender), with each participant rating only one set. Within each set, the four combinations of video ads were separated by a 2 min interval to isolate any carryover effects and give participants a break if needed.
Informed consent and demographic information were obtained from all the participants. Participants were informed of the purpose of the study and given time to familiarize themselves with the environment, wearable sensors, and procedures. Physiological signals were recorded from participants throughout the experiment.
While watching the ads, the participants sat on a sofa and looked directly at the television. The ads were played on an LCD screen with a diagonal of 49 in (approximately 125 cm). The viewing distance was set to 2 m, which was consistent with the reference viewing environment for evaluating HDTV images specified in SMPTE ST 2080-3:2017 [
41], as the nominal distance of the viewer from the center of the reference screen should be 3 to 3.2 frame heights. After viewing each ad, participants were asked to rate their level of ad engagement in a survey provided on a laptop computer. The study lasted an average of 45 min.
3.4. Psychometric Measures
The ground truth for emotional engagement in ads was measured using the User Engagement Scale-Short Form (UES-SF) [
10]. It is a 12-item questionnaire covering four dimensions of engagement: Focused Attention (FA), Aesthetic Appeal (AE), Perceived Usability (PU), and Reward (RW). The dimensions were rated on a 5-point scale and the total score was calculated as an average across the selected dimensions.
The UES-SF questionnaire items were adapted to suit the context of measuring ad engagement. This is in line with the guidelines of the UES-SF, where the items from the UES-SF dimensions can be adapted to suit the task at hand [
10]. In the context of ad engagement, the UES-SF measures participants’ affective (emotional) and behavioral dimensions, focusing on positive and negative affect along with aesthetic and sensory appeal, perceived usability, interest and time, and overall experience, as shown in
Table 1. For example, PU is defined as “negative affect experienced as a result of the interaction and the degree of control and effort expended”, while AE is defined as “the attractiveness and visual appeal of the interface” (or, in our case, the ad) [
10]. Example items for both dimensions, tailored to our case: “PU.1: I felt frustrated while watching this Ad.” and “AE.1 This Ad was attractive”.
Along with engagement ratings, participants’ affective state (valence and arousal) and tiredness were also measured using self-reports. Valence and arousal are independent, bipolar dimensions of affect, represented on a scale from pleasant to unpleasant and from active to passive, respectively. According to [
42], any emotion can be described in terms of two basic dimensions. Both affective dimensions were measured on a 7-point scale (e.g., extremely passive–extremely active). Tiredness was measured on a 5-point scale (extremely tired–not tired at all).
3.5. Physiological Measurements
Several physiological sensor signals were recorded from the participants: eye tracking data were recorded using the Tobii Pro Glasses 2 eye tracker (pupil dilation). Heart rate, skin temperature, and electrodermal activity (EDA) were recorded using Empatica E4 wrist bracelet [
43] placed on the dominant wrist.
Time synchronization of the signals was ensured by the user making a single clap before the video started. The clap was identified on the video recorded using Tobii Pro Glasses and Empatica E4 signals. Using the video and signal editor, the time stamps of all sensor devices were synchronized manually. After the synchronization, a spline signal representation of order 3 was applied, and the missing value analysis and corrected nonuniform sampling of all the time-dependent physiological signals were performed. All signals were then resampled to a common sampling frequency of 30 Hz which matched the frame rate of the video.
3.5.1. Heart Rate
Raw heart rate data were acquired using the Empatica photoplethysmography (PPG) sensor, an unobtrusive method commonly used to monitor heart rate parameters and oximetry [
44]. The original data were sampled at 64 Hz, filtered, and resampled at 128 Hz using a spline-based algorithm to replace missing samples and enhance peak locations. Data were processed and sampled using the Python library Neurokit2 for biomedical signal processing [
45]. The PPG signal for heart rate analysis was processed using the Elgendi processing pipeline [
46]. Using an interpolated sample rate of 128 Hz, we scaled down the heart rate measurement resolution based on the peak detection to below 1 bpm. The PPG-established time-varying heart rate values were interpolated using monotonic cubic interpolation and exported at 30 Hz sample rate. For segmented signal analysis, specifically for the identification of inter-beat intervals required by HRV feature extraction, the built-in capability of the library was used to process event-separated signal segments called epochs.
3.5.2. Electrodermal Activity and Skin Temperature
Empatica E4 was also used to acquire electrodermal activity (EDA) and skin temperature, both at a sampling rate of 4 Hz. A spline-based algorithm was used to filter the signal and handle the missing samples. The EDA data included the number of peaks detected in skin conductance response (SCR) for each segment and the corresponding mean values of the peak amplitudes. The reported accuracy of the temperature sensor was 0.2 °C and its resolution was 0.02 °C. As only one wristband was used throughout the study, no bias adjustment was performed. The data for the EDA and skin temperature were resampled to 30 Hz to match the sample rate of the other signals.
3.5.3. Pupillary Response
Pupil responses were measured using Tobii 2 eye-tracking glasses [
47] and changes in pupil diameter were extracted as raw signals. Raw data were reported by the devices at nonuniform intervals. A spline-based algorithm was used to process and resample the raw pupil response data, while maintaining the mean pupil diameters of the left and right eyes. A blind luminance compensation method was used to compensate for the effects of luminance in the direction of gaze on the pupil diameter. Literature indicates that the pupillary light response to screen viewing is likely linear [
31]. Therefore, for each participant, an OLS model was applied to the pupil data as a function of the display brightness throughout the observation period. The obtained gain parameter was used individually for each participant to determine the influence of brightness variance and to subtract the modeled response from the actual pupil dilation values.
3.6. Statistical Analysis and Machine Learning
Data preprocessing, statistical analysis, and visualization were performed in Python v.3.10 [
48] using the libraries pinguoun v.0.5.3 [
49], statsanalysis v0.2.3 [
50], statsmodels v.0.14.0 (for logistic regression) [
51], and seaborn v.0.12.2 [
52]. The libraries mlxtend [
53] and scikit-learn [
54] were used for machine learning.
Shapiro–Wilk and Levene tests were used to test the normality and homoscedasticity of the distributions. Because the data were not normally distributed, nonparametric Mann–Whitney U and Kruskal–Wallis tests were used. The significance level was set at = 0.05, with a Bonferroni correction for multiple comparisons. The intraclass correlation coefficient (ICC) was used to test the inter-rater agreement of the UES ratings.
Machine Learning
The raw physiological data were preprocessed and normalized. The type of normalization is reported where relevant. Feature generation was performed for each participant and for ad. The time-series analysis library pycatch22 [
55] was used to generate features from the skin temperature and pupil dilation signals. Pycatch22 generates 22 time-series specific features describing the symbolic, temporal, and frequency domains, including the distribution shape, timing of extreme events, linear and nonlinear autocorrelation, incremental differences, and self-affine scaling (for an overview and feature definitions, see [
55]). Note that, before generating the features, catch22 automatically z-normalizes the data. NeuroKit2 v.0.2.1 [
45], a Python library for phyisiological signal processing, was used to process and generate heart rate variability (HRV) features from heart rate signals and to extract tonic and phasic features from GSR signals. The mean and standard deviation of the tonic and phasic GSR features were used as features for EDA.
Cross-validation was used for the training and evaluation of the machine learning models. This method uses different subsamples (k-folds) of data to train and evaluate the models by running multiple iterations and averaging the performance scores. Cross-validation ensures that the model does not overfit, as can be the case in a traditional train–test split. For example, in the case of repeated stratified cross-validation, the folds are stratified based on the target, ensuring an even distribution of target data for each fold, repeating the cross-validation procedure multiple times.
All features with collinearity > 95% and/or zero variance were removed. Further analysis and selection were performed using recursive feature elimination (RFE) with 5-fold cross-validation and the LightGBM (LGBM) classifier from scikit-learn. The most important features of each signal were retained. A total of 30 features from the four signals (HRV, EDA, skin temperature, and pupil size) were used for classification.
The effects of signal fusion on the performance of the classifier were analyzed using Exhaustive feature selector [
53] by selecting and evaluating all possible signal combinations 5-fold cross-validation repeated 3 times and the LGBM. The repeated k-fold method ensures the objective validation of feature selection.
The gradient boosting classifiers LGBM, HistGradientBoostingClassifier (HGBC) and XGBoost (XGB) from scikit-learn were used as machine learning models [
54]. Gradient boosting is a type of ensemble modeling technique in which multiple base models (e.g., decision trees) are trained, and the predictions of the base models are then aggregated into a single prediction by the ensemble model [
56]. A gradient boosting model “is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function” [
57]. An additional advantage of gradient boosting classifiers is that they are insensitive to scale differences in data and can handle missing data. The latter is particularly relevant in cases where sensor malfunctions and recording errors could significantly decrease available training data, which is often the case in real-world settings, as is in this case.
Repeated stratified k-fold cross-validation (n_splits = 10, n_repeats = 5) was used to evaluate the classifier performance, with the ROC AUC serving as a measure of model performance. The optimization configurations for all classifiers were left at the default values.
Several steps were taken to improve the interpretability of the gradient boosting classifiers. SHAP values were used to explain the output of the classifier and the effect of each feature on the model [
58]. Additionally, several baseline logistic regression models were trained on the raw signal data and selected feature sets from sensor fusion to provide further insights into the impact of features on modeling ad engagement.
5. Discussion and Conclusions
This study investigated the potential of physiological signals and affect as reliable predictors of emotional engagement in video ads. The results presented in
Section 4 and
Appendix A confirm the main hypothesis that engagement in video ads can be modeled with physiological signals alone, retaining comparable performance to the models based on feature sets combining physiological signals and affect.
The key findings of this research boil down to the role of specific physiological signals and how their fusion aids in more effective modeling of ad engagement. The results also clearly show that signal fusion can significantly reduce the number of features while maintaining stable classification performance. This is particularly important in cases where continuous measurement of engagement is desired and physiological signals must be evaluated in near real time.
The three features generated from skin temperature and the two features for affect (valence and tiredness) were found to be the best performing signal fusion set. The models trained on these features also outperformed the models trained on the larger combined feature set based on all the signals. Moreover, the results presented in
Table 5 show that skin temperature is an important indicator of ad engagement in several signal fusion combinations. The results of sensor fusion are further substantiated by the SHAP analysis of gradient boosting models in
Section 4.4, as well as by the logistic regression models presented in the
Appendix A.
To the best of our knowledge, the only comparable study using machine learning and physiological signals to investigate engagement in video ads was by [
39]. Their results are similar to ours, but with a different physiology signal (EEG), a smaller number of participants (23 vs. 50), and different machine learning models and settings. The average F1 they reported is nearly 0.7 for the binary classification of high and low self-reported engagement. For general comparison, our average F1 score and its standard deviation for HGBC is 0.71 (0.06), with the best F1 score being 0.86, using repeated stratified cross-validation (n_splits = 10, n_repeats = 5).
In relation to the existing work, the significance of skin temperature as a predictor of ad engagement is a surprising finding. More prominent indicators of emotional engagement in the current state of the art are physiological signals of HRV, EDA, and pupil dilation [
8,
9,
20,
21,
59,
60,
61].
A few studies that focused specifically on skin temperature have reported how skin temperature correlated with arousal and stress. For example, Ref. [
59] conducted a study on musical emotions and found that skin temperature inversely correlated with elevated arousal and negative emotions but increased with calmness and positive emotions. A study by [
60] found that the effects of stress lead to consistent temperature changes, with the temperature decreasing at distal skin locations. These findings are consistent with the results of the presented study, as shown by the SHAP analysis in
Figure 7. Higher skin temperature (temp_MD _hrv_classic_pnn40) and positive valence (Mood_V_user) positively correlated with higher emotional ad engagement. In contrast, higher arousal (Mood_A_user) was negatively correlated with higher engagement, as were higher phasic EDA (EDA_Phasic_std), higher HRV (HRV_MeanNN and HRV_RMSSD), and higher pupil dilation (pup_SB _BinaryStats_mean_longstretch1).
Another observation related to existing work concerns the role of the affective dimensions of valence and arousal as indicators of ad engagement. It was expected that arousal would be a strong indicator of emotional engagement, as several studies have reported that elevated arousal increased engagement behavior (e.g., [
5,
7,
14]). Instead, in the presented study, valence was a better predictor, which may suggest that the participants were influenced more by whether the content of the ad was pleasant than by its arousal level. The negative correlation between elevated arousal and higher ad engagement is consistent with the findings of [
5], who showed that arousal is positively related to the noticeability of the ad, but can still elicit negative attitudes toward the ad. The authors argued that the negative relationship between arousal and attitude toward the ads can be explained “by creative executions of the ads, which do not appear to be positively perceived” [
5] (p. 9).
In terms of limitations, the study’s reliance on self-reported measures for ad engagement could potentially introduce bias, as participants’ responses may be subject to factors such as social desirability or lack of introspective accuracy. The ICC statistical test showed low inter-rater agreement on the ad engagement rankings. Moreover, the generalizability of the findings may be limited due to the relatively small, non-random sample. In addition, this study did not consider the effects of different types of ads or product categories on physiological responses and engagement behaviors, which could significantly influence ad engagement. With more data, it might be beneficial to analyze engagement at multiple levels rather than just as a binary classification problem (lower vs. higher).
Despite these limitations, we believe that the research presented here offers new insights into the physiological measurement of ad engagement and the role of signal fusion in classification performance. Future research on this topic will aim to expand upon the findings by increasing the sample size. This research would benefit from a wider diversity of participants, which could potentially yield more varied and comprehensive results. Different physiological measures could also be explored to ascertain their effectiveness in predicting ad engagement. More importantly, the incorporation of additional variables in the classifiers, such as demographics and situational context, could contribute to improved prediction models for ad engagement.
In conclusion, the presented research contributes to the existing body of knowledge by investigating the potential of machine learning and physiological signals as predictors of ad engagement. It highlights the importance of signal fusion and demonstrates how a dimensionally reduced set of physiological signals can provide reliable classification.