Demographics and Personality Discovery on Social Media: A Machine Learning Approach
Abstract
:1. Introduction
- We propose methods for extracting demographic and personality attributes from Reddit users using author flairs.
- Multiple feature sets are also proposed and explored by machine learning algorithms to find the best-performing combinations.
- To validate our experimental results, processed author flairs are applied as ground truth for the training and testing process.
2. Materials and Methods
2.1. Experimental Data
2.2. Framework Overview
2.3. Data Preprocessing
2.4. Private Attribute Extraction
2.4.1. Gender Identity
2.4.2. Age Group
2.4.3. Residential Area
2.4.4. Education Level
2.4.5. Political Affiliation
2.4.6. Religious Belief
2.4.7. Personality Type
2.5. Feature Extraction
2.5.1. Human-Designed Features
2.5.2. Bag-of-Words (BoW) Features
2.5.3. Community Activity (CA) Features
Algorithm 1. The proposed feature extraction and selection algorithm for community activity features. |
1: function ExtractActivityFeatures(PreprocessedComments, UseWeighted, SelectKBest) |
2: ActivityFeatures[][] = A two-dimensional array |
3: for each Comment in PreprocessedComments do |
4: User = Comment’s author |
5: Community = Community of comment’s post |
6: ActivityFeatures[User][Community] += 1 |
7: end for |
8: if UseWeighted then |
9: ActivityFeatures = CalcuateWeight(ActivityFeatures) |
10: end if |
11: if SelectKBest then |
12: ActivityFeatures = FTest(ActivityFeatures, 100) |
13: end if |
14: return ActivityFeatures |
15: end function |
2.5.4. Hybrid Features (HF)
Algorithm 2. The proposed feature extraction and selection algorithm for hybrid features. |
1: function ExtractHybridFeatures(LIWCFeatures, NgramFeatures, ActivityFeatures, UseLIWC, UseWeighted, SelectKBest) |
2: HybridFeatures[][] = A two-dimensional array |
3: Users = Users in NgramFeatures and ActivityFeatures |
4: for each User in Users do |
5: for each Feature in NgramFeatures do |
6: FeatureValue = NgramFeatures[User][Feature] |
7: HybridFeatures[User][Feature] = FeatureValue |
8: end for |
9: for each Feature in ActivityFeatures do |
10: FeatureValue = ActivityFeatures[User][Feature] |
11: HybridFeatures[User][Feature] = FeatureValue |
12: end for |
13: end for |
14: if UseLIWC then |
15: for each User in Users do |
16: for each Feature in LIWCFeatures do |
17: FeatureValue = LIWCFeatures [User][Feature] |
18: HybridFeatures[User][Feature] = FeatureValue |
19: end for |
20: end for |
21: end if |
22: if UseWeighted then |
23: HybridFeatures = Tfidf(HybridFeatures) |
24: end if |
25: if SelectKBest then |
26: HybridFeatures = FTest(HybridFeatures, 10000) |
27: end if |
28: return HybridFeatures |
29: end function |
2.6. Feature Selection
2.7. Classification Algorithms
- Majority class classifier (MCC) always classifies the most frequent class in the dataset. This classifier is often used as the baseline against machine learning models to demonstrate their superior decision-making.
- Multinomial naïve Bayes (NB) is a popular conditional probabilistic classifier. We used one of the classic variants used in text classification with Laplace smoothing.
- Support vector machine (SVM) [14] creates a discrimination hyperplane between two sets of data points. We used linear SVM with the L2 penalization and squared hinge as the loss function. We used the one-vs-rest strategy for multi-class datasets.
- Random forest (RF) [15] is a majority-voting classifier that consists of multiple decision trees, each trained with a different dataset. We created a random forest with 100 decision trees with the maximum features equal to the square root of the original number of features.
- Multi-layer perceptron (MLP) is a fully connected artificial neural network. We used two hidden layers, each with 64 units with the rectified linear unit (ReLU) activation. We held out 10% of the training data to use as the validation set for early stopping.
2.8. Imbalance Problem
3. Results
3.1. Classification Performance
3.2. Training Time
3.3. Robustness
3.4. Imbalance Problem
4. Discussion
4.1. Demographic Attributes
4.2. Personality Types
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Smedt, T.D.; Pauw, G.D.; Ostaeyen, P.V. Automatic Detection of Online Jihadist Hate Speech. arXiv 2018, arXiv:1803.04596. [Google Scholar]
- Zhao, W.X.; Li, S.; He, Y.; Wang, L.; Wen, J.-R.; Li, X. Exploring Demographic Information in Social Media for Product Recommendation. Knowl. Inf. Syst. 2016, 49, 61–89. [Google Scholar] [CrossRef] [Green Version]
- Neal, A.; Yeo, G.; Koy, A.; Xiao, T. Predicting the Form and Direction of Work Role Performance from the Big 5 Model of Personality Traits. J. Organ. Behav. 2012, 33, 175–192. [Google Scholar] [CrossRef]
- Matz, S.C.; Kosinski, M.; Nave, G.; Stillwell, D.J. Psychological Targeting as an Effective Approach to Digital Mass Persuasion. Proc. Natl. Acad. Sci. USA 2017, 114, 12714–12719. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Myers, I.B. Gifts Differing: Understanding Personality Type; CPP Books: Palo Alto, CA, USA, 1993; ISBN 978-0-89106-064-2. [Google Scholar]
- Barbuto, J.E. A Critique of the Myers-Briggs Type Indicator and Its Operationalization of Carl Jung’s Psychological Types. Psychol. Rep. 1997, 80, 611–625. [Google Scholar] [CrossRef]
- McCrae, R.R.; Costa, P.T. Reinterpreting the Myers-Briggs Type Indicator from the Perspective of the Five-Factor Model of Personality. J. Pers. 1989, 57, 17–40. [Google Scholar] [CrossRef]
- Furnham, A. The Big Five versus the Big Four: The Relationship between the Myers-Briggs Type Indicator (MBTI) and NEO-PI Five Factor Model of Personality. Personal. Individ. Differ. 1996, 21, 303–307. [Google Scholar] [CrossRef]
- Kosinski, M.; Stillwell, D.; Graepel, T. Private Traits and Attributes Are Predictable from Digital Records of Human Behavior. Proc. Natl. Acad. Sci. USA 2013, 110, 5802–5805. [Google Scholar] [CrossRef] [Green Version]
- Aletras, N.; Chamberlain, B.P. Predicting Twitter User Socioeconomic Attributes with Network and Language Information. In Proceedings of the 29th on Hypertext and Social Media, Baltimore, MD, USA, 9–12 July 2018; ACM: New York, NY, USA, 2018; pp. 20–24. [Google Scholar]
- Ferwerda, B.; Tkalcic, M. Predicting Users’ Personality from Instagram Pictures: Using Visual and/or Content Features? In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, Singapore, 8–11 July 2018; ACM: New York, NY, USA, 2018; pp. 157–161. [Google Scholar]
- Gjurković, M.; Šnajder, J. Reddit: A Gold Mine for Personality Prediction. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, New Orleans, LA, USA, 6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 87–97. [Google Scholar]
- Tausczik, Y.R.; Pennebaker, J.W. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. J. Lang. Soc. Psychol. 2010, 29, 24–54. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Ho, T.K. Random Decision Forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Replacement Token | Description |
---|---|
xxurl | URL |
xxuser | Author name |
xxsub | Community name |
xxrep | Repeated word |
xxelon | Elongated word |
xxd | One-digit number |
xxdd | Two-digit number |
xxddd | Three-digit number |
xxdddd | Four-digit number |
xxddddd | Five-digit number |
xxeos | The end of a comment |
Author Name | Community | Comment Body |
---|---|---|
###### | AMD_Stock | this is all i could find, fwiw xxurl it does line up with the expected end... |
###### | TributeMe | this is my favorite yet. if you pm me more, i will tribute it. xxurl |
###### | news | the trolls from t d constantly brigade astroturf xxsub in a bid to control t... |
###### | squirting | source xxurl gif starts at xxd xxdd xxdd |
###### | AskReddit | somebody already did your job. something for you to read xxurl |
###### | Windows10 | xxurl xxelon xxurl basically, instead of white, we get that subdued color... |
###### | tifu | thank you for submitting to xxsub, xxuser. your submission, tifu by oblit... |
###### | sydney | i figure this clears xxelon xxuser xxelon xxuser, xxuser xxelon xxuser… |
###### | CxTV | this thread was crossposted from xxurl made by xxuser. to mute xpost… |
###### | woooosh | far from heaven, xxsub! xxsub is xxelon better |
Author Name | Aggregated Comments |
---|---|
###### | sure, but this is not xxsub, it is xxsub. posts need to demonstrate they... |
###### | xxurl xxeos xxurl xxeos philosophy mainly, or fiction that tends to be phil... |
###### | it is happened before, it will happen again not mine, xxuser s xxeos i wan... |
###### | from xxurl appropriate swim attire required, cotton shorts or shirts, spor... |
###### | you could try posting to xxsub xxeos i do not really have a time limit on da... |
###### | now i am just confused xxurl sh e xxddd db xxdd xxeos angry birds from outer... |
###### | you are cordially invited to xxsub xxeos is her cousin stan from south park... |
###### | here you go man. sorry, fell asleep. xxurl gclid cj xxd kcqjw xxdd bbrd a... |
###### | does not look like the xxsub has a chat. maybe try pming the mods and sugge... |
###### | hkj reviews of aa nimh chargers xxurl this sofirn looks ok not many chargers... |
Author Name | Community Name | Flair Class | Flair Text |
---|---|---|---|
###### | AskALiberal | - | Centrist Democrat |
###### | atheism | no-knight | Atheist |
###### | ConservativesOnly | - | McCarthy did nothing wrong |
###### | Conservative | Conservative | Conservative |
###### | sexover30 | male | ♂ 50 |
###### | Judaism | Orange | converting Conservative |
###### | Christianity | chirho | Christian (Chi Rho) |
###### | Christianity | coeusa | Episcopalian (Anglican) |
###### | datingoverthirty | male | ♂ Forty Minus One |
###### | datingoverthirty | female | ♀ 32 |
Attribute | Communities |
---|---|
Gender Identity | 40something, AskMen, AskMenOver30, AskWomen, AskWomenOver30, DatingAfterThirty, DirtySnapchat, GWABackstage, LGBTeens, OkCupid, RelationshipsOver35, Tinder, amiugly, asktransgender, askwomenadvice, assholegonewild, childfree, datingoverthirty, keto, loseit, sexover30, xxketo |
Age Group | 40something, DatingAfterThirty, LGBTeens, OkCupid, RelationshipsOver35, Tinder, childfree, datingoverthirty, keto, loseit, sexover30, teenager, xxketo, teenager |
Residential Area | AskAnAmerican, Africa, Arabs, Argentina, Brazil, Cambodia, Chile, China, Colombia, Europe, India, Indonesia, Japan, Korea, Laos, Malaysia, Thailand |
Education Level | GradSchool, college, teenager |
Political Affiliation | AskALiberal, CanadaPolitics, Conservative, ConservativesOnly, Republican, True_AskAConservative, askaconservative, liberalgunowners, ukpolitics |
Religious Belief | AskAChristian, AskReligion, Christianity, DebateAChristian, DebateAnAtheist, DebateReligion, Judaism, OpenChristian, TrueChristian, atheism, excatholic, exchristian, survivor |
Personality Type | MBTI, ENFJ, ENFP, ENTJ, ENTP, ESFJ, ESFP, ESTJ, ESTP, INFJ, INFP, INTJ, INTP, ISFJ, ISFP, ISTJ, ISTP |
Feature Set | #Features | Description | |
---|---|---|---|
Baseline | LIWC | 64 | Human-designed LIWC frequency features. |
LIWC_Tfidf | 64 | Human-designed LIWC tf-idf features. | |
BoW_Ngrams | 20,000 | Uni-grams and bi-grams tf-idf features. | |
BoW_Stemmed | 20,000 | Stemmed uni-grams and bi-grams tf-idf features. | |
Proposed | CA_Freq | 53,966 | Community activity frequency features. |
CA_Freq_100 | 100 | 100 k-best community activity frequency features. | |
CA_Wgt_100 | 100 | 100 k-best community activity weighted features. | |
HF | 20,100 | Hybrid tf-idf features. | |
HF_LIWC | 20,164 | Hybrid tf-idf features with LIWC tf-idf features. | |
HF_10k | 10,000 | Top 10k hybrid tf-idf features. |
Private Attribute | #Users | Class | #Users |
---|---|---|---|
Gender Identity (Gen.) | 17,589 | Male | 8797 * |
Female | 8792 | ||
Age Group (Age) | 4136 | Young Adult | 1791 |
Teenager | 1790 * | ||
Younger Middle-Aged | 501 | ||
Older Middle-Aged | 54 | ||
Residential Area (Res.) | 4723 | North American | 4967 |
European | 4965 * | ||
South American | 2701 | ||
South Asian | 1770 | ||
Southeast Asian | 1738 | ||
East Asian | 799 | ||
Middle Eastern | 477 | ||
African | 29 | ||
Education Level (Edu.) | 3499 | High School | 1787 |
Graduate | 1046 | ||
Undergraduate | 666 | ||
Political Affiliation (Pol.) | 810 | Conservative | 475 |
Liberal | 335 | ||
Religious Belief (Rel.) | 2709 | Atheist | 1730 |
Christian | 857 | ||
Muslim | 50 | ||
Jewish | 36 | ||
Buddhist | 20 | ||
Hindu | 16 | ||
Personality Type (Per.) | 4723 | INTP | 1196 |
INTJ | 1078 | ||
ENFP | 529 | ||
INFJ | 504 | ||
ENTP | 374 | ||
INFP | 329 | ||
ISTP | 259 | ||
ISTJ | 94 | ||
ENTJ | 87 | ||
ESTP | 58 | ||
ISFJ | 52 | ||
ISFP | 49 | ||
ENFJ | 47 | ||
ESFP | 41 | ||
ESFJ | 13 | ||
ESTJ | 13 | ||
Introversion/Extraversion (I/E) | 4723 | Introversion | 3561 |
Extraversion | 1162 | ||
Sensing/Intuition (S/N) | 4723 | Intuition | 4144 |
Sensing | 579 | ||
Thinking/Feeling (T/F) | 4723 | Thinking | 3159 |
Feeling | 1564 | ||
Judging/Perception (J/P) | 4723 | Perception | 2835 |
Judging | 1888 |
Feature Set | Gen. | Age | Edu. | Res. | Pol. | Rel. | Per. | E/I | S/N | T/F | J/P | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | None (MCC) | 0.333 | 0.151 | 0.055 | 0.226 | 0.370 | 0.130 | 0.025 | 0.430 | 0.467 | 0.401 | 0.375 |
LIWC | 0.757 | 0.361 | 0.293 | 0.546 | 0.608 | 0.255 | 0.078 | 0.533 | 0.486 | 0.602 | 0.551 | |
LIWC_Tfidf | 0.781 | 0.362 | 0.336 | 0.553 | 0.592 | 0.231 | 0.064 | 0.437 | 0.467 | 0.578 | 0.496 | |
BoW_Ngrams | 0.895 | 0.480 | 0.702 | 0.791 | 0.728 | 0.447 | 0.222 | 0.595 | 0.529 | 0.702 | 0.624 | |
BoW_Stemmed | 0.895 | 0.478 | 0.707 | 0.791 | 0.729 | 0.467 | 0.231 | 0.591 | 0.545 | 0.706 | 0.636 | |
Proposed | CA_Freq | 0.892 | 0.508 | 0.868 | 0.956 | 0.896 | 0.490 | 0.545 | 0.810 | 0.768 | 0.835 | 0.863 |
CA_Freq_100 | 0.840 | 0.477 | 0.901 | 0.981 | 0.907 | 0.495 | 0.644 | 0.871 | 0.859 | 0.863 | 0.886 | |
CA_Wgt_100 | 0.880 | 0.498 | 0.947 | 0.979 | 0.915 | 0.606 | 0.562 | 0.868 | 0.836 | 0.871 | 0.878 | |
HF | 0.921 | 0.520 | 0.854 | 0.907 | 0.877 | 0.531 | 0.511 | 0.760 | 0.721 | 0.808 | 0.801 | |
HF_LIWC | 0.920 | 0.517 | 0.855 | 0.907 | 0.873 | 0.557 | 0.511 | 0.761 | 0.722 | 0.808 | 0.801 | |
HF_10k | 0.921 | 0.520 | 0.856 | 0.907 | 0.877 | 0.562 | 0.515 | 0.759 | 0.722 | 0.809 | 0.816 |
Our Methods | Gjurković and Šnajder [12] | |
---|---|---|
Personality Type | 64.4% (NB) | 41.7% (MLP) |
Introversion/Extraversion | 87.1% (MLP) | 82.8% (MLP) |
Sensing/Intuition | 85.9% (RF) | 79.2% (MLP) |
Thinking/Feeling | 87.1% (RF) | 67.2% (LR) |
Judging/Perception | 88.6% (MLP) | 74.8% (LR) |
Feature Set | Gen. | Age | Edu. | Res. | Pol. | Rel. | Per. | E/I | S/N | T/F | J/P | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | LIWC | MLP | RF | NB | RF | NB | NB | NB | NB | SVM | NB | NB |
LIWC_Tfidf | MLP | RF | MLP | RF | RF | RF | RF | RF | MLP | RF | RF | |
BoW_Ngrams | MLP | MLP | MLP | MLP | MLP | MLP | SVM | MLP | MLP | SVM | MLP | |
BoW_Stemmed | MLP | MLP | MLP | MLP | MLP | MLP | SVM | MLP | MLP | SVM | SVM | |
Proposed | CA_Freq | RF | NB | RF | RF | RF | SVM | RF | RF | RF | RF | RF |
CA_Freq_100 | MLP | NB | RF | RF | RF | NB | NB | MLP | RF | RF | MLP | |
CA_Wgt_100 | MLP | RF | RF | RF | RF | RF | RF | RF | RF | RF | RF | |
HF | MLP | MLP | SVM | SVM | SVM | MLP | SVM | SVM | SVM | SVM | SVM | |
HF_LIWC | MLP | MLP | SVM | SVM | SVM | MLP | SVM | SVM | SVM | SVM | SVM | |
HF_10k | MLP | MLP | SVM | SVM | SVM | MLP | SVM | MLP | SVM | SVM | RF |
Feature Set | Gen. | Age | Edu. | Res. | Pol. | Rel. | Per. | E/I | S/N | T/F | J/P | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | LIWC | 539 | 83 | 527 | 73 | 47 | 138 | 121 | 101 | 105 | 112 | 115 |
LIWC_Tfidf | 561 | 86 | 529 | 75 | 47 | 137 | 120 | 104 | 108 | 117 | 119 | |
BoW_Ngrams | 667 | 120 | 788 | 115 | 61 | 158 | 150 | 121 | 130 | 153 | 135 | |
BoW_Stemmed | 2426 | 366 | 2461 | 345 | 239 | 667 | 510 | 462 | 479 | 519 | 495 | |
Proposed | CA_Freq | 228 | 61 | 388 | 70 | 16 | 43 | 84 | 92 | 63 | 73 | 69 |
CA_Freq_100 | 49 | 8 | 39 | 8 | 2 | 5 | 10 | 9 | 9 | 10 | 10 | |
CA_Wgt_100 | 103 | 17 | 83 | 15 | 3 | 10 | 22 | 18 | 18 | 20 | 21 | |
HF | 1004 | 154 | 879 | 147 | 83 | 223 | 213 | 170 | 175 | 206 | 194 | |
HF_LIWC | 1788 | 280 | 1618 | 249 | 146 | 421 | 322 | 321 | 333 | 371 | 362 | |
HF_10k | 996 | 140 | 834 | 128 | 72 | 208 | 153 | 156 | 165 | 185 | 180 |
Feature Set | Gen. | Age | Edu. | Res. | Pol. | Rel. | Per. | E/I | S/N | T/F | J/P | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | LIWC | 0.245 | 0.638 | 0.728 | 0.454 | 0.454 | 0.776 | 0.938 | 0.554 | 0.532 | 0.434 | 0.490 |
LIWC_Tfidf | 0.236 | 0.638 | 0.696 | 0.447 | 0.408 | 0.769 | 0.935 | 0.563 | 0.532 | 0.422 | 0.503 | |
BoW_Ngrams | 0.164 | 0.615 | 0.493 | 0.306 | 0.388 | 0.705 | 0.800 | 0.475 | 0.489 | 0.388 | 0.412 | |
BoW_Stemmed | 0.164 | 0.616 | 0.490 | 0.311 | 0.376 | 0.723 | 0.808 | 0.465 | 0.482 | 0.388 | 0.414 | |
Proposed | CA_Freq | 0.136 | 0.504 | 0.171 | 0.141 | 0.131 | 0.547 | 0.552 | 0.285 | 0.293 | 0.256 | 0.262 |
CA_Freq_100 | 0.080 | 0.179 | 0.048 | 0.032 | 0.060 | 0.302 | 0.218 | 0.071 | 0.077 | 0.086 | 0.063 | |
CA_Wgt_100 | 0.107 | 0.459 | 0.050 | 0.038 | 0.070 | 0.276 | 0.403 | 0.122 | 0.149 | 0.116 | 0.101 | |
HF | 0.127 | 0.585 | 0.287 | 0.174 | 0.286 | 0.663 | 0.667 | 0.382 | 0.398 | 0.294 | 0.255 | |
HF_LIWC | 0.133 | 0.586 | 0.301 | 0.195 | 0.269 | 0.688 | 0.706 | 0.404 | 0.362 | 0.325 | 0.254 | |
HF_10k | 0.118 | 0.574 | 0.239 | 0.131 | 0.252 | 0.659 | 0.584 | 0.307 | 0.320 | 0.268 | 0.232 |
Private Attribute Dataset | Classes | Instances | Imbalance Ratio |
---|---|---|---|
Gender Identity | 2 | 17,589 | 1.00 |
Age Group | 4 | 4136 | 33.17 |
Education Level | 3 | 3499 | 2.68 |
Residential Area | 8 | 17,446 | 171.28 |
Political Affiliation | 2 | 810 | 1.42 |
Religious Belief | 6 | 2709 | 108.12 |
Personality Type | 16 | 4723 | 92.00 |
Introversion/Extraversion | 2 | 4723 | 3.06 |
Sensing/Intuition | 2 | 4723 | 7.16 |
Thinking/Feeling | 2 | 4723 | 2.02 |
Judging/Perception | 2 | 4723 | 1.50 |
Feature Set | Technique | Gen. | Age | Edu. | Res. | Pol. | Rel. | Per. | E/I | S/N | T/F | J/P |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CA_Freq | None | 0.892 | 0.508 | 0.868 | 0.956 | 0.896 | 0.490 | 0.545 | 0.810 | 0.768 | 0.835 | 0.863 |
RO | 0.892 | 0.520 | 0.903 | 0.969 | 0.917 | 0.454 | 0.521 | 0.809 | 0.773 | 0.848 | 0.865 | |
SMOTE | 0.891 | 0.527 | 0.896 | 0.965 | 0.905 | 0.481 | 0.530 | 0.792 | 0.751 | 0.835 | 0.857 | |
CA_Wgt_100 | None | 0.880 | 0.498 | 0.947 | 0.979 | 0.915 | 0.606 | 0.562 | 0.868 | 0.836 | 0.871 | 0.878 |
RO | 0.868 | 0.513 | 0.955 | 0.984 | 0.917 | 0.598 | 0.560 | 0.861 | 0.815 | 0.871 | 0.871 | |
SMOTE | 0.867 | 0.508 | 0.959 | 0.983 | 0.919 | 0.561 | 0.518 | 0.851 | 0.786 | 0.867 | 0.867 | |
HF | None | 0.921 | 0.520 | 0.854 | 0.907 | 0.877 | 0.531 | 0.511 | 0.760 | 0.721 | 0.808 | 0.801 |
RO | 0.918 | 0.563 | 0.907 | 0.916 | 0.880 | 0.691 | 0.558 | 0.816 | 0.775 | 0.825 | 0.815 | |
SMOTE | 0.918 | 0.551 | 0.906 | 0.919 | 0.884 | 0.685 | 0.542 | 0.824 | 0.778 | 0.829 | 0.815 |
Feature Set | F1 Score |
---|---|
MCC | 0.025 |
LIWC_Tfidf + RF | 0.064 |
Demographic attributes + LR | 0.048 |
CA_Freq_100 + NB | 0.644 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tuomchomtam, S.; Soonthornphisaj, N. Demographics and Personality Discovery on Social Media: A Machine Learning Approach. Information 2021, 12, 353. https://doi.org/10.3390/info12090353
Tuomchomtam S, Soonthornphisaj N. Demographics and Personality Discovery on Social Media: A Machine Learning Approach. Information. 2021; 12(9):353. https://doi.org/10.3390/info12090353
Chicago/Turabian StyleTuomchomtam, Sarach, and Nuanwan Soonthornphisaj. 2021. "Demographics and Personality Discovery on Social Media: A Machine Learning Approach" Information 12, no. 9: 353. https://doi.org/10.3390/info12090353
APA StyleTuomchomtam, S., & Soonthornphisaj, N. (2021). Demographics and Personality Discovery on Social Media: A Machine Learning Approach. Information, 12(9), 353. https://doi.org/10.3390/info12090353