In order to show predict the EHO of users, we (i) first collected a dataset; then (ii) trained a predictive model, both as a regression and a classification problem; and finally (iii) performed the evaluation.
3.1. Data Acquisition
For our study, we decided to collect the following data about the participants:
EHO of users.
Personality.
Genre preferences.
Film sophistication.
Furthermore, for each participant, we wanted to collect assessments about movies.
We conducted a user study to collect the data required. In total, we had 350 users providing 3499 assessments of 703 movies. We generated a pool of 1000 popular movies from the Movielens 25M dataset, from which 55 movies were randomly selected to be shown to each study participant.
In the first step, we measured the demographics, genre preferences, personality, EHO and the film sophistication of each participant. The features extracted from these questions are summarized in
Table 1 as U–F. The answers to the demographics questions are referred to as DEMQ. DEMQ includes questions about gender, education and age. In order to specify the gender, the user could choose among male, female, other and prefer not to say.
For this study, we considered six educational categories: primary school or lower, secondary school, university bachelor’s degree, university master’s degree, university PhD and other professional education degrees. The user was required to input their age, which was verified to be over 18.
GPREFQ refers to the genre preference answers. Each user was asked to rate different genres of movies including action, adventure, comedy, drama, fantasy, history, romance, science fiction and thriller with a score ranging from 1 to 5.
For measuring personality traits, we used the Big Five 44 Item Inventory measure proposed by John and Srivastava [
24] and Ten-Item Personality Inventory (TIPI) measure proposed by Gosling et al. [
28]. Due to the higher correlation value of
Extraversion and
Openness traits with EO and HO, we used questions from the Big Five 44 Item Inventory. For
Agreeableness,
Conscientiousness and
Neuroticism, we used TIPI questions. The answers to the questions are referred to as BFIQ and range from 1 to 7. Based on these answers, we calculated the value associated with each factor in the FFM, which we refer to as BFT. Using the list of FFM questionnaires in the same order provided by John and Srivastava [
24], the
Extraversion and
Openness traits are calculated as follows:
where
Ex and
Op stand for the personality traits
Extraversion and
Openness, respectively.
is the n-th question in the BFI questionnaire proposed by John and Srivastava [
24]. Using the list of TIPI questionnaires in the same order provided by Gosling et al. [
28], the
Agreeableness,
Conscientiousness and
Neuroticism traits are calculated as follows:
where
Ag,
Co and
Ne stand for the personality traits
Agreeableness,
Conscientiousness and
Neuroticism, respectively.
is the n-th question in the TIPI questionnaire proposed by Gosling et al. [
28].
Oliver and Raney [
13] included six statements related to
EO and six statements related to HO. In accordance with the correlation between the questions proposed by Oliver and Raney [
13] and
EO/
HO values, we selected three statements for measuring each. We asked users to tell us to which degree they agree with the statements on a scale from 1 to 7. Assuming the same order of questions as in Oliver and Raney [
13],
refers to the n-th question.
EO and
HO are calculated as follows:
Müllensiefen et al. [
29] proposed a factor structure of a reduced self-report inventory for measuring the music sophistication index. Based on this work, the questionnaire of the
Goldsmiths Musical Sophistication Index (Gold MSI) (
https://shiny.gold-msi.org/gmsiconfigurator/ (accessed on 15 September 2022)) was designed for the music domain. In order to measure film sophistication, we adapted the music sophistication index questionnaire to fit the movie domain. The answers to the film sophistication questionnaire are referred to as SFIQ, which are on a scale from 1 to 7. From these answers, the following two film sophistication factors referred to as SFI are calculated: (i)
Active Engagement and (ii)
Emotions. Assuming the same order of questions in the Gold MSI questionnaire,
refers to the n-th question for each sophistication factor.
Active Engagement and
Emotions are calculated as follows:
where
AE and
EM stand for the film sophistication factors of
Active Engagement and
Emotions, respectively.
In the second step, among 55 movies presented to the participants, they were asked to select ten. They were instructed to choose movies they have watched or they were familiar enough to judge in terms of their preferences and feelings while watching them.
FPREFQ is the rating of users on a scale from 1 to 5. The eudaimonic and hedonic perceptions (EHP) of users from movies were measured with the questionnaire, adapted from the one proposed by Oliver and Raney [
13] (EHPQ). According to the correlation between the questions proposed by Oliver and Raney [
13] and the eudaimonic perception (
EP)/hedonic perception (
HP), we selected two statements for measuring each. We asked users to tell us to which degree they agree with the statements on a scale from 1 to 7. Assuming the same order of questions as in Oliver and Raney [
13],
refers to the n-th question.
EP and
HP are calculated as follows:
3.2. Machine-Learning Workflow
The goal of the machine-learning algorithm was to predict two user characteristics, the eudaimonic orientation and the hedonic orientation values, from features collected in the user study. We approached this prediction in two ways: (i) as a regression problem and (ii) as a classification problem, where we used median splitting to label users with high- and low-eudaimonic orientation and high- and low-hedonic orientation.
As can be seen in
Figure 1, the collected data is fed into the pipeline. In our dataset, the features are either numerical, including integer and float types, or categorical, including nominal or ordinal features. There are two categorical features in the dataset: i.e, gender and education. We assumed gender as a nominal categorical feature and therefore used
OneHotEncoder class from the
Scikit-learn library for encoding it, which encodes categorical features as a one-hot numeric array.
We assumed that education is an ordinal categorical feature, and therefore we encoded it with
OrdinalEncoder from the
Scikit-learn library. All the other features in the dataset are numerical data. Since feature scaling is required only for machine-learning estimators that consider the distance between observations and not every estimator, this step is not always performed. The list of machine-learning algorithms that use the scaling step can be seen in
Table 2 and
Table 3. In the case of performing feature scaling, we used the
StandardScale class from the
Scikit-learn.
For training the model, we used a nested K fold cross-validation approach in which we could optimize the hyperparameters of the model. Different numbers of folds were used for outer cross-validation (where we evaluated the dataset) and inner cross-validation (where we tuned parameters on the evaluation sets). Feature selection was made by both manual and automatic methods. Manual feature selection was performed in the initial steps by limiting the features to the list of desired features.
We also performed the automated feature selection by feeding the varied number of features (k parameter of SelectKBest class) as a hyperparameter in the pipeline (referred to as automated feature selection in
Figure 2). Automatic feature selection was performed using SelectKBest class from scikit-learn Library, in which mutual information between individual features and the target variable was used to decide on the final set with k features. For parameter k of SelectKBest, we used all integer numbers (
n) in the range of:
We trained seven machine-learning algorithms to predict the EHO values of users: Lasso, Ridge, SVR, K-nearest neighbors, decision tree, random forest and gradient boosted trees (XGBoost). We selected these models to investigate a range of models, including linear and non-linear models. Given the fact that the choice of different hyper parameters may change the results considerably, we chose a varied range of values for different hyper parameters [
30].
The list of machine-learning algorithms and the corresponding hyper parameters is provided in
Table 2. In this paper, we also define two classification problems. One for predicting users’ classes based on their eudaimonic orientation: (i) high eudaimonic oriented, (ii) low eudaimonic oriented; the other for predicting users’ classes based on their hedonic orientation: (i) high hedonic oriented, (ii) low hedonic oriented. The list of machine-learning algorithms and the hyper parameters of the classification problem is provided in
Table 3.