A Machine Learning-Based Method for Content Verification in the E-Commerce Domain
Abstract
:1. Introduction
1.1. Overview
1.2. Related Works
2. Materials and Methods
2.1. Dataset Exploration
- C is the total number of calculations;
- n is the number of instances contained in the dataset;
- r is the number of instances compared in each calculation.
- Name normalization: One of the most common and difficult issues while dealing with consolidating data from multivariate and heterogeneous sources is the existence of different features in the consumed data instances. Text normalization deals with the processes of transforming the original raw text into a canonical form, which is different from the initial one. Multiple methods can be utilized to transform the raw data, including Unicode quirks, upper to lower case conversion, irrelevant symbol removal, whitespace collapsing and normalization and conversion of each distinct word to a double metaphone. Additionally, the detachment of special characters (umlauts, rare letters, accents and/or other non-typical Unicode-based normalizations) and stop words (e.g., string punctuations) is part of this step, using the appropriate libraries.
- Greed matching of the pairs: the next step involves ignoring multiple matches, the detachment of duplicates and eventually the concatenation of the remaining pairs of persons.
- is the number of matching characters. The characters whose distance is not greater than the result of Equation (3) are considered as matching characters:
- t is half the number of transpositions. A transposition is considered to occur when two characters are the same but not in the same place in the two strings examined. Thus, the number of transpositions is half the number of the matched misplaced characters.
- and are the lengths of strings and , respectively.
2.2. Method Followed
- Firstly, a conventional ML algorithm was selected, namely, the logistic regression algorithm, in order to train the model using the original, imbalanced dataset.
- Consequently, an up-sampling method was applied to the original, imbalanced dataset. Up-sampling techniques are processes of randomly duplicating records from the minority label (class), in order to improve the model’s extracted metrics and its overall performance in comparison with the original, imbalanced dataset [25,26,27]. The application of this type of method to the original dataset leads to the resampling of its initial records, setting the final number of the minority class samples (which correspond to duplicate entries) equal to the number of the majority class instances in the original, imbalanced dataset.
- This generated, up-sampled dataset was applied again in the (initially selected) logistic regression algorithm in order to compare the extracted performance metrics with those extracted from the initial LR implementation in the original, imbalanced dataset.
3. Results
- TP stands for true positives, i.e., a record labeled as similar (label = 1) that, indeed, concerns two similar person instances;
- TN stands for true negatives, i.e., a record labeled as non-similar (label = 0) that, indeed, concerns two different persons;
- FP stands for false positives, i.e., a record labeled as similar (label = 1) that actually concerns two different persons;
- FN stands for false negatives, i.e., a record labeled as non-similar (label = 0) that actually concerns two similar person instances.
- y is the sample label y∊{0,1};
- p is the probability of each sample belonging to a class, e.g., p = Pr(y = 1).
3.1. Imbalanced Dataset Results
3.2. Up-Sampled Dataset Results
3.3. Feature Importance Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- CISCO. The Zettabyte Era: Trends and Analysis; e Cisco Visual Networking Index (Cisco VNI); CISCO: San Jose, CA, USA, 2016. [Google Scholar]
- The World Bank. Crossing Borders; World Development Report; The World Bank: Washington, DC, USA, 2021. [Google Scholar]
- Leonelli, S. Scientific Research and Big Data. In The Stanford Encyclopedia of Philosophy; Zalta, E.N., Ed.; Metaphysics Research Lab., Stanford University: Stanford, CA, USA, 2020. [Google Scholar]
- Zhu, G.; Zhang, X.; Wang, L.; Zhu, Y.; Dong, X. An Intelligent Data De-Duplication Based Backup System. In Proceedings of the 2012 15th International Conference on Network-Based Information Systems, Melbourne, Australia, 26–28 September 2012; pp. 771–776. [Google Scholar]
- Hall, D.L.; Llinas, J. An Introduction to Multisensor Data Fusion. Proc. IEEE 1997, 85, 6–23. [Google Scholar] [CrossRef] [Green Version]
- Akter, S.; Wamba, S.F. Big Data Analytics in E-Commerce: A Systematic Review and Agenda for Future Research. Electron. Mark. 2016, 26, 173–194. [Google Scholar] [CrossRef] [Green Version]
- Tran, M.-Q.; Elsisi, M.; Mahmoud, K.; Liu, M.-K.; Lehtonen, M.; Darwish, M.M.F. Experimental Setup for Online Fault Diagnosis of Induction Machines via Promising IoT and Machine Learning: Towards Industry 4.0 Empowerment. IEEE Access 2021, 9, 115429–115441. [Google Scholar] [CrossRef]
- Ćirović, G.; Pamučar, D.; Božanić, D. Green Logistic Vehicle Routing Problem: Routing Light Delivery Vehicles in Urban Areas Using a Neuro-Fuzzy Model. Expert Syst. Appl. 2014, 41, 4245–4258. [Google Scholar] [CrossRef]
- Policarpo, L.M.; da Silveira, D.E.; da Rosa Righi, R.; Stoffel, R.A.; da Costa, C.A.; Barbosa, J.L.V.; Scorsatto, R.; Arcot, T. Machine Learning through the Lens of E-Commerce Initiatives: An up-to-Date Systematic Literature Review. Comput. Sci. Rev. 2021, 41, 100414. [Google Scholar] [CrossRef]
- Carvalho, M.; Laender, A.; Gonçalves, M.; Silva, A. A Genetic Programming Approach to Record Deduplication. Knowl. Data Eng. IEEE Trans. 2012, 24, 399–412. [Google Scholar] [CrossRef]
- Christen, P.; Goiser, K. Towards Automated Data Linkage and Deduplication. Computer 2019, 16, 22–24. [Google Scholar]
- Elfeky, M.G.; Verykios, V.S.; Elmagarmid, A.K. TAILOR: A Record Linkage Toolbox. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, 26 February–1 March 2002; pp. 17–28. [Google Scholar]
- Gschwind, T.; Miksovic, C.; Minder, J.; Mirylenka, K.; Scotton, P. Fast Record Linkage for Company Entities. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 623–630. [Google Scholar]
- Rajbahadur, G.K.; Wang, S.; Ansaldi, G.; Kamei, Y.; Hassan, A.E. The Impact of Feature Importance Methods on the Interpretation of Defect Classifiers. IEEE Trans. Softw. Eng. 2021, 1. [Google Scholar] [CrossRef]
- Zhu, Z.; Ong, Y.-S.; Dash, M. Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2007, 37, 70–76. [Google Scholar] [CrossRef] [PubMed]
- Tran, M.-Q.; Elsisi, M.; Liu, M.-K. Effective Feature Selection with Fuzzy Entropy and Similarity Classifier for Chatter Vibration Diagnosis. Measurement 2021, 184, 109962. [Google Scholar] [CrossRef]
- Niño-Adan, I.; Manjarres, D.; Landa-Torres, I.; Portillo, E. Feature Weighting Methods: A Review. Expert Syst. Appl. 2021, 184, 115424. [Google Scholar] [CrossRef]
- Alexakis, T.; Peppes, N.; Adamopoulou, E.; Demestichas, K.; Remoundou, K. Evaluation of Content Fusion Algorithms for Large and Heterogeneous Datasets. In Security Technologies and Social Implications: An European Perspective; Wiley-IEEE Press (pending publication): Hoboken, NJ, USA, 2022. [Google Scholar]
- Jaro, M.A. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. J. Am. Statitstical Assoc. 1989, 84, 414–420. [Google Scholar] [CrossRef]
- Winkler, W. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage; ERIC: Middletown, OH, USA, 1990; pp. 354–359. [Google Scholar]
- Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
- Gomaa, W.H.; Fahmy, A.A. A Survey of Text Similarity Approaches. Int. J. Comput. Appl. 2013, 68, 13–18. [Google Scholar]
- Jaccard, P. Distribution de La Flore Alpine Dans Le Bassin Des Dranses et Dans Quelques Régions Voisines. Bull. Soc. Vaud. Sci. Nat. 1901, 37, 241–272. [Google Scholar] [CrossRef]
- Weisstein, E.W. Combination. Available online: https://mathworld.wolfram.com/Combination.html (accessed on 9 December 2021).
- Marqués, A.I.; García, V.; Sánchez, J.S. On the Suitability of Resampling Techniques for the Class Imbalance Problem in Credit Scoring. J. Oper. Res. Soc. 2013, 64, 1060–1070. [Google Scholar] [CrossRef] [Green Version]
- More, A. Survey of Resampling Techniques for Improving Classification Performance in Unbalanced Datasets. arXiv 2016, arXiv:1608.06048. [Google Scholar]
- Peppes, N.; Daskalakis, E.; Alexakis, T.; Adamopoulou, E.; Demestichas, K. Performance of Machine Learning-Based Multi-Model Voting Ensemble Methods for Network Threat Detection in Agriculture 4.0. Sensors 2021, 21, 7475. [Google Scholar] [CrossRef]
- Islah, N.; Koerner, J.; Genov, R.; Valiante, T.A.; O’Leary, G. Machine Learning with Imbalanced EEG Datasets Using Outlier-Based Sampling. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC), Montreal, QC, Canada, 20–24 July 2020; pp. 112–115. [Google Scholar]
- Maldonado, S.; López, J. Dealing with High-Dimensional Class-Imbalanced Datasets: Embedded Feature Selection for SVM Classification. Appl. Soft Comput. 2018, 67, 94–105. [Google Scholar] [CrossRef]
- Ganganwar, V. An Overview of Classification Algorithms for Imbalanced Datasets. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 42–47. [Google Scholar]
- Panigrahi, R.; Borah, S.; Bhoi, A.K.; Ijaz, M.F.; Pramanik, M.; Kumar, Y.; Jhaveri, R.H. A Consolidated Decision Tree-Based Intrusion Detection System for Binary and Multiclass Imbalanced Datasets. Mathematics 2021, 9, 751. [Google Scholar] [CrossRef]
- Hossin, M.; Sulaiman, M.N. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1. [Google Scholar]
- Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: Berlin/Heidelberg, Germany, 2006; ISBN 0-387-31073-8. [Google Scholar]
- Mishra, A. Metrics to Evaluate Your Machine Learning Algorithm. Available online: https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234 (accessed on 9 December 2021).
- Zien, A.; Krämer, N.; Sonnenburg, S.; Rätsch, G. The Feature Importance Ranking Measure. In Proceedings of the Machine Learning and Knowledge Discovery in Databases; Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 694–709. [Google Scholar]
- Alwosheel, A.; van Cranenburgh, S.; Chorus, C.G. Is Your Dataset Big Enough? Sample Size Requirements When Using Artificial Neural Networks for Discrete Choice Analysis. J. Choice Model. 2018, 28, 167–182. [Google Scholar] [CrossRef]
Name | Surname | Address | DoB | Sex | Label |
---|---|---|---|---|---|
0.0 | 0.0 | 0.53626 | 0.61851 | 0.63888 | 0 |
0.46666 | 0.48148 | 0.51024 | 0.53148 | 0.63888 | 0 |
1.0 | 0.96111 | 0.90648 | 0.97407 | 1.0 | 1 |
Classifier | Parameters |
---|---|
Logistic Regression (LR) | penalty = ‘elasticnet’, tol = 0.001, C = 1.0, class_weight = None, random_state = 1001, solver = ‘saga’, max_iter = 1000, verbose = 1001, n_jobs = -1, l1_ratio = 0.5 |
Random Forest (RF) | n_estimators = 500, criterion = ‘entropy’, max_features = ‘log2’, n_jobs = -1, random_state = 1002, verbose = 1 |
Penalized SVC | kernel = ‘linear’, class_weight = ‘balanced’, probability = True, penalty = ‘l1’, loss = ‘squared_hinge’, tol = 0.001, C = 2.0, multi_class = ‘ovr’, verbose = 1, random_state = 1002, max_iter = 10000 |
Neural Network (NN) | hidden layers = 3, model = ‘sequential’, input_dim = 5, activation_function = ’relu’, loss = ’binary_crossentropy’, optimizer = ‘adam’, metrics = [‘accuracy’] |
Algorithm | Accuracy | Loss | Precision | Recall | F1-Score |
---|---|---|---|---|---|
LR | 0.9980 | 0.0698 | 0.5000 | 0.5000 | 0.5000 |
NN | 1.0000 | 0.0002 | 0.5000 | 0.5000 | 0.5000 |
RF | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 1.0000 |
SVC | 0.9994 | 0.0209 | 0.8800 | 1.0000 | 0.9300 |
Algorithm | Accuracy | Loss | Precision | Recall | F1-Score |
---|---|---|---|---|---|
LR | 0.9992 | 0.0276 | 1.0000 | 1.0000 | 1.0000 |
NN | 1.0000 | 0.0000 | 0.3300 | 0.5000 | 0.4000 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alexakis, T.; Peppes, N.; Demestichas, K.; Adamopoulou, E. A Machine Learning-Based Method for Content Verification in the E-Commerce Domain. Information 2022, 13, 116. https://doi.org/10.3390/info13030116
Alexakis T, Peppes N, Demestichas K, Adamopoulou E. A Machine Learning-Based Method for Content Verification in the E-Commerce Domain. Information. 2022; 13(3):116. https://doi.org/10.3390/info13030116
Chicago/Turabian StyleAlexakis, Theodoros, Nikolaos Peppes, Konstantinos Demestichas, and Evgenia Adamopoulou. 2022. "A Machine Learning-Based Method for Content Verification in the E-Commerce Domain" Information 13, no. 3: 116. https://doi.org/10.3390/info13030116
APA StyleAlexakis, T., Peppes, N., Demestichas, K., & Adamopoulou, E. (2022). A Machine Learning-Based Method for Content Verification in the E-Commerce Domain. Information, 13(3), 116. https://doi.org/10.3390/info13030116