An Email Cyber Threat Intelligence Method Using Domain Ontology and Machine Learning
Abstract
:1. Introduction
- A novel, domain-specific ontology for emails is presented that focuses only on email message metadata, including technical fields that indicate the email message’s path from the initial sender to the final recipient.
- A new semantic parser for preserving the privacy of data in email messages was developed that uses a semantic representation of the email message’s metadata to populate the proposed ontology and create a dataset.
- It is possible to use a semantic representation of an email message’s metadata to classify encrypted email messages without knowledge of the decryption key. The email encryption standard S/MIME [21] cryptographically protects only the body of the messages, while the header fields remain in plaintext as the SMTP servers need to deliver messages correctly.
- Empirical quantification of the proposed method using machine learning for spam classification enabled us to improve the accuracy of classifying emails; in particular, an accuracy of 92.28% and an F1 score of 95.92% were obtained.
2. Related Work
- The email classification approaches proposed by the other authors use natural language processing methods to analyze the content of the messages. On the other hand, if we want to construct an effective email CTI sharing framework, then the privacy of the messages’ content must be preserved.
- One of the biggest shortcomings of natural-language-processing-based methods is their dependence on the written language of the message.
- As stated in [30], to differentiate spam from authentic information, ML techniques examine patterns and attributes in the data, and ontologies formally represent domain knowledge to generate rules for detecting social spammers.
3. Materials and Methods
- C1—developing a semantic parser module using emails from the public mail corpus that parses only the messages’ metadata;
- C2—populating the domain-specific ontology developed from the email with the data obtained from the parsing module and preparing the labeled dataset for ML;
- C3—using various ML algorithms to train the models and performing evaluations to select the best model.
- S1—to preserve privacy, the parser module parses only the metadata from the users’ email;
- S2—populating the domain-specific ontology with data obtained from the parsing module and transmitting this to the trained ML model for making predictions;
- S3—the deployed ML model evaluates the message’s metadata according to the domain-specific ontology and returns predictions developed for the next decision-making module.
3.1. Domain-Specific Email Ontology
3.2. Applying the Domain-Specific Ontology to the Collected Email Messages
Algorithm 1. The pseudo-code of the semantic parser. |
|
Output: |
Email-specific ontology O appended with new individuals and relationships representing email message msg. |
End the semantic parser algorithm. |
- None of the sensitive parts in terms of privacy (such as the body or the content of the attachments) of the message are included while asserting instances and relationships of the entities forming the semantic representation of the email message. As the results presented in this study showed, in most cases, it is sufficient to use only the semantically enriched metadata of the message to successfully filter unsolicited messages from the good ones.
- The data represented in RDF format could be very easily processed using an OWL-based reasoner. On the basis of the relationships between the instances, the reasoner can infer the new properties, implicit relationships with other instances, and membership of subclasses. Furthermore, the reasoner’s ability to correlate instances more precisely than stated facts is made possible by the taxonomy of the classes’ hierarchical structure, ontological relationships, and constraints.
- The asserted and inferred data are stored in the form of triplets in specialized structures that have the capacity for ad hoc data queries. Well-known querying languages, such as SPARQL (SPARQL Protocol and RDF Query Language), could be used to further enrich the semantic representation of the email message’s metadata. The foundation of SPARQL queries is the “triplet pattern” matching mechanism, which follows the triplet configuration of the RDF statements and offers an efficient mechanism for matching triplets.
- It is possible to use the semantic representation of email message metadata for the classification of encrypted email messages without knowledge of the decryption key. The S/MIME email encryption standard cryptographically protects only the body of the messages, whereas the header fields remain in plaintext because the SMTP servers need to deliver the message correctly. In such a case, only the metadata of the attached files and links to external resources cannot be extracted and populated. That is, classification is possible on the SMTP servers of the service provider without compromising the confidentiality of the final user’s data.
3.3. Creation of the ML Model Using the Domain-Specific Ontology
4. Experimental Settings and Results
4.1. Dataset
4.2. Experimental Results of Evaluating the Proposed Framework
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Jesus, V.; Bains, B.; Chang, V. Sharing Is Caring: Hurdles and Prospects of Open, Crowd-Sourced Cyber Threat Intelligence. IEEE Trans. Eng. Manag. 2023, 71, 6854–6873. [Google Scholar] [CrossRef]
- Mujtaba, G.; Shuib, L.; Raj, R.G.; Majeed, N.; Al-Garadi, M.A. Email Classification Research Trends: Review and Open Issues. IEEE Access 2017, 5, 9044–9064. [Google Scholar] [CrossRef]
- Noor, U.; Anwar, Z.; Amjad, T.; Choo, K.R. A machine learning-based FinTech cyber threat attribution framework using high-level indicators of compromise. Future Gener. Comput. Syst. 2019, 96, 227–242. [Google Scholar] [CrossRef]
- Sakellariou, G.; Fouliras, P.; Mavridis, I.; Sarigiannidis, P.A. Reference Model for Cyber Threat Intelligence (CTI) Systems. Electronics 2022, 11, 1401. [Google Scholar] [CrossRef]
- Ramsdale, A.; Shiaeles, S.; Kolokotronis, N. A Comparative Analysis of Cyber-Threat Intelligence Sources, Formats and Languages. Electronics 2020, 9, 824. [Google Scholar] [CrossRef]
- Hitzler, P.; Krötzsch, M.; Rudolph, S. Foundations of Semantic Web Technologies; Chapman & Hall/CRC: Boca Raton, FL, USA, 2009. [Google Scholar]
- The MITRE Corporation about CAPEC. Available online: https://capec.mitre.org/index.html (accessed on 22 June 2024).
- Roy, S.; Panaousis, E.; Noakes, C.; Laszka, A.; Panda, S.; Loukas, G. SoK: The MITRE ATT&CK Framework in Research and Practice. 2023. Available online: https://arxiv.org/abs/2304.07411 (accessed on 22 June 2024).
- Al-Sada, B.; Sadighian, A.; Oligeri, G. MITRE ATT&CK: State of the Art and Way Forward. 2023. Available online: https://arxiv.org/abs/2308.14016 (accessed on 22 June 2024).
- OASIS Open. Introduction to STIX. 2019. Available online: https://oasis-open.github.io/cti-documentation/stix/intro.html (accessed on 22 June 2024).
- Jordan, B.; Varner, D. TAXII Version 2.1. OASIS Standard. 10 June 2021. Available online: https://docs.oasis-open.org/cti/taxii/v2.1/os/taxii-v2.1-os.pdf (accessed on 31 May 2024).
- Syed, Z.; Padia, A.; Finin, T.W.; Mathews, M.L.; Joshi, A. UCO: A Unified Cybersecurity Ontology. In Proceedings of the AAAI Workshop: Artificial Intelligence for Cyber Security, Phoenix, AZ, USA, 12 February 2016. [Google Scholar] [CrossRef]
- Preuveneers, D.; Joosen, W. An Ontology-Based Cybersecurity Framework for AI-Enabled Systems and Applications. Future Internet 2024, 16, 69. [Google Scholar] [CrossRef]
- Onwubiko, C. CoCoa: An Ontology for Cybersecurity Operations Centre Analysis Process. In Proceedings of the 2018 International Conference On Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA), Glasgow, UK, 11–12 June 2018; pp. 1–8. [Google Scholar]
- Mozzaquatro, B.A.; Agostinho, C.; Goncalves, D.; Martins, J.; Jardim-Goncalves, R. An Ontology-Based Cybersecurity Framework for the Internet of Things. Sensors 2018, 18, 3053. [Google Scholar] [CrossRef] [PubMed]
- Huang, C.-C.; Huang, P.-Y.; Kuo, Y.-R.; Wong, G.-W.; Huang, Y.-T.; Sun, Y.S.; Chang Chen, M. Building Cybersecurity Ontology for Understanding and Reasoning Adversary Tactics and Techniques. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 4266–4274. [Google Scholar]
- Saidani, N.; Adi, K.; Allili, M.S. A semantic-based classification approach for an enhanced spam detection. Comput. Secur. 2020, 94, 101716. [Google Scholar] [CrossRef]
- Jeeva, L.; Khan, I.S. A Review Article On Enhancing Email Spam Filter’s Accuracy Using Machine Learning. Int. J. Innov. Res. Comput. Sci. Technol. 2023, 11, 5–11. [Google Scholar] [CrossRef]
- Gibson, S.; Issac, B.; Zhang, L.; Jacob, S.M. Detecting Spam Email With Machine Learning Optimized With Bio-Inspired Metaheuristic Algorithms. IEEE Access 2020, 8, 187914–187932. [Google Scholar] [CrossRef]
- Jáñez-Martino, F.; Alaiz-Rodríguez, R.; González-Castro, V.; Fidalgo, E.; Alegre, E. A review of spam email detection: Analysis of spammer strategies and the dataset shift problem. Artif. Intell. 2023, 56, 1145–1173. [Google Scholar] [CrossRef]
- Schaad, J.; Ramsdell, B.; Turner, S. Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 4.0 Message Specification. RFC 8551. 2019. Available online: https://www.rfc-editor.org/info/rfc8551 (accessed on 1 July 2024). [CrossRef]
- Ainslie, S.; Thompson, D.; Maynard, S.; Ahmad, A. Cyber-threat intelligence for security decision-making: A review and research agenda for practice. Comput. Secur. 2023, 132, 103352. [Google Scholar] [CrossRef]
- Sun, N.; Ding, M.; Jiang, J.; Xu, W.; Mo, X.; Tai, Y.; Zhang, J. Cyber Threat Intelligence Mining for Proactive Cybersecurity Defense: A Survey and New Perspectives. IEEE Commun. Surv. Tutor. 2023, 20, 1186–1199. [Google Scholar] [CrossRef]
- Zavrak, S.; Yilmaz, S. Email spam detection using hierarchical attention hybrid deep learning method. Expert Syst. Appl. 2023, 233, 120977. [Google Scholar] [CrossRef]
- Nguyen, T.; Karunanayake, N.; Wang, S.; Seneviratne, S.; Hu, P. Privacy-preserving spam filtering using homomorphic and functional encryption. Comput. Commun. 2023, 197, 230–241. [Google Scholar] [CrossRef]
- Kiamarzpour, F.; Dianat, R.; Bahrani, M.; Sadeghzadeh, M. Improving the methods of email classification based on words ontology. arXiv 2013, arXiv:1310.5963. Available online: https://arxiv.org/ftp/arxiv/papers/1310/1310.5963.pdf (accessed on 1 July 2024).
- Wang, M.; Song, L. An Incentive Mechanism for Reporting Phishing E-Mails Based on the Tripartite Evolutionary Game. Secur. Commun. Netw. 2021, 2021, 3394325. [Google Scholar] [CrossRef]
- Sathya, J.; Mary Harin Fernandez, F. An Optimizing Crime Detection in Social Media Platforms Using Multiagent Ontology-Based Approach. In Proceedings of the 4th International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 20–22 September 2023. [Google Scholar]
- Omotehinwa, T.O.; Oyewola, D.O. Hyperparameter Optimization of Ensemble Models for Spam Email Detection. Appl. Sci. 2023, 13, 1971. [Google Scholar] [CrossRef]
- Al-Hassan, M.; Abu-Salih, B.; Al Hwaitat, A. DSpamOnto: An Ontology Modelling for Domain-Specific Social Spammers in Microblogging. Big Data Cogn. Comput. 2023, 7, 109. [Google Scholar] [CrossRef]
- Venčkauskas, A.; Toldinas, J.; Morkevičius, N.; Sanfilippo, F. Email Domain-specific Ontology and Metadata Dataset. Mendeley Data 2024. [Google Scholar] [CrossRef]
- Resnick, P. Internet Message Format; RFC Editor. 2008. p. RFC5322. Available online: https://www.rfc-editor.org/rfc/pdfrfc/rfc5322.txt.pdf (accessed on 23 June 2024).
- Sirbu, M.A. Content-Type Header Field for Internet Messages; RFC Editor. 1988. p. RFC1049. Available online: https://www.rfc-editor.org/rfc/pdfrfc/rfc1049.txt.pdf (accessed on 23 June 2024).
- Freed, N.; Borenstein, N. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies; RFC Editor. 1996. p. RFC2045. Available online: https://www.ietf.org/rfc/rfc2045.txt (accessed on 23 June 2024).
- Klensin, J. Simple Mail Transfer Protocol. RFC Editor. 2008, p. RFC5321. Available online: https://datatracker.ietf.org/doc/html/rfc5321 (accessed on 23 June 2024).
- SpamAssassin. Available online: https://github.com/stdlib-js/datasets-spam-assassin (accessed on 18 May 2024).
- Feature Selection and Feature Transformation Using Classification Learner App. Available online: https://se.mathworks.com/help/stats/feature-selection-and-feature-transformation.html#buwh5ae-1 (accessed on 18 May 2024).
- Liu, H.; Setiono, R. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 5–8 November 1995; pp. 388–391. [Google Scholar] [CrossRef]
- Train Multiclass Naive Bayes Model. Available online: https://se.mathworks.com/help/stats/fitcnb.html (accessed on 22 June 2024).
Reference | Dataset | Advantages | Disadvantages |
---|---|---|---|
Zavrak et al. [24] | TREC 2007, GenSpam, SA, LS, Enron (EN) | A combination of CNN, gated recurrent units, and attention mechanisms Cross-dataset experiments | Does not preserve the privacy of email messages Only for messages in English language |
Nguyen et al. [25] | TREC07p, CEAS08-1, ENRON | Preserves email messages’ privacy using HE and FE Predicts the label of an encrypted email | Challenges with the distribution of keys and setting up the server |
Kiamarzpour et al. [26] | SpamBase | Filters spam by using the words’ ontology | Does not preserve the privacy of email messages A legacy dataset was used |
Wang et al. [27] | – | The approach was based on the tripartite evolutionary game model | The custom dataset of email networks was collected by the North University of China and is not publicly available |
Sathya et al. [28] | ImageNet | A framework that leverages ontology-based techniques, multiagent optimization algorithms, and semantic analysis | Used to classify images from social media platforms Utilized a pretrained CNN model and the ImageNet dataset |
Omotehinwa et al. [29] | Enron | Proposed fine-tuned spam detection models based on the random forest (RF) and extreme gradient boost (XGBoost) algorithms | Does not preserve the privacy of email messages |
Al-Hassan et al. [30] | MIB dataset | Proposed a domain-specific ontology for detecting social spammers on microblogging platforms that target a certain domain | Detects social spammers on microblogging platforms only |
Subject | Forward Relationship | Reverse Relationship | Object |
---|---|---|---|
Message | hasToAddress hasFromAddress hasSenderAddress hasReplyToAddress | isToAddress isFromAddress isSenderAddress isReplyToAddress | EmailAddress |
Message | hasSentTime | DateTime | |
EmailAddress | hasUserDomain | isUserDomainOf | Domain |
EmailAddress | hasUser | isUserOf | EmailUser |
Message | wasRelayedBy | indicatesRelayFor | MTALine |
MTALine | hasRole | Role | |
MTALine | hasDateTime | DateTime | |
MTALine | hasByHost hasFromHost | isFromHost isByHost | Host |
MTALine | hasByDomain hasFromDomain | isByDomain isFromDomain | Domain |
Message | hasURL | isURLIn | URL |
URL | hasHost | isHostOf | HostAddress |
URL | hasResource | isResourceOf | URLResource |
Message | hasAttachment | isAttachmentIn | Attachment |
Attachment | hasFileName | isFileNameOf | AttachedFileName |
Attachment | hasFileType | isFileTypeOf | AttachedFileType |
Feature | Description | Feature | Description |
---|---|---|---|
M_ID | Globally unique message identifier assigned by the originator MTA | OMTA_BY_H | Host name extracted from the By-domain part of the Received field at the originator SMTP server |
SUBJECT | Human-visible subject of the message | OMTA_BY_D | Host domain extracted from the By-domain part of the Received field at the originator SMTP server |
SENT_TS | Sent timestamp assigned by the originator server | OMTA_TS | Timestamp at the originator SMTP server |
CONT_TYPE | Content part of the Content-Type header field | OMTA_DELAY | Delay of the message (in ms) at the originator SMTP server |
CONT_SUBTYPE | Subtype part of the Content-Type header field | OMTA_TT | Total travel time of the message (in ms) |
CONT_PARAM | Parameter part of the Content-Type header field | DMTA_NR | Delivery SMTP server’s hop number |
ENC | Encoding of the email body | DMTA_FROM_H | Host name extracted from the From-domain part of the Received field at the delivery SMTP’s server |
USER_AGENT | User-Agent header field provided by the sender | DMTA_FROM_D | Host domain extracted from the From-domain part of the Received field at the delivery SMTP server |
FROM_P | Display-name part of the From email address | DMTA_BY_H | Host name extracted from the By-domain part of the Received field at the delivery SMTP server |
FROM_U | Local-part of the From email address | DMTA_BY_D | Host domain extracted from the By-domain part of the Received field at the originator SMTP server |
FROM_D | Domain part of the From email address | DMTA_TS | Timestamp at the delivery SMTP server |
TO_P | Display-name part of the first To email address | DMTA_DELAY | Delay of the message (in ms) at the originator SMTP server |
TO_U | Local-part of the first To email address | DMTA_TT | Total travel time of the message (in ms) |
TO_D | Domain part of the first To email address | ATT_COUNT | Total count of attachments |
CC_P | Display-name of the first CC email address | ATT_FIRSTNAME | Name of the first attached file |
CC_U | Local-part of the first CC email address | ATT_FIRSTEXT | Extension of the first attached file |
CC_D | Domain part of the first CC email address | URL_CNT | Total count of unique URLs in the message’s body |
REPLY_P | Display-name part of the Reply-to email address | URL1_PROT | Protocol of the first URL in the message’s body |
REPLY_U | Local-part of the Reply-to email address | URL1_HOST | Host of the first URL in the message’s body |
REPLY_D | Domain part of the Reply-to email address | URL1_FILE | Resource name of the first URL in the message’s body |
SENDER_P | Display-name part of the Sender email address | URL2_PROT | Protocol of the second URL in the message’s body |
SENDER_U | Local-part of the Sender email address | URL2_HOST | Host of the first URL in the message’s body |
SENDER_D | Domain part of the Sender email address | URL2_FILE | Resource name of the second URL in the message’s body |
OMTA_NR | Originator SMTP server’s hop number (usually 1) | URL3_PROT | Protocol of the third URL in the message’s body |
OMTA_FROM_H | Host name extracted from the From-domain part of the Received [35] field at the originator SMTP server | URL3_HOST | Host of the third URL in the message’s body |
OMTA_FROM_D | Host domain extracted from the From-domain part of the Received field at the originator SMTP server | URL3_FILE | Resource name of the third URL in the message’s body |
Message Group Name | Definition | Class Label | Number of Messages in the Group | Number of Messages in the Dataset |
---|---|---|---|---|
easy-ham-1 | Easily detected non-spam emails | easy-ham | 2500 | 2500 |
easy-ham-2 | Easily detected non-spam emails collected later | easy-ham-2 | 1400 | 1397 |
hard-ham-1 | Non-spam emails that are hard to detect | hard-ham | 250 | 248 |
spam-1 | Spam emails | spam | 500 | 485 |
spam-2 | Spam emails collected later | spam-2 | 1396 | 1331 |
Total number of messages | 6046 | 5961 |
Class Label | Number of Messages for Learning (~90%) | Number of Messages for Testing (~10%) | Total Number of Messages of This Class in the Dataset |
---|---|---|---|
easy-ham | 2250 | 250 | 2500 |
easy-ham-2 | 1257 | 140 | 1397 |
hard-ham | 224 | 24 | 248 |
spam | 436 | 49 | 485 |
spam-2 | 1198 | 133 | 1331 |
Total number of records | 5365 | 596 | 5961 |
Model Type | Accuracy, % (Validation) | Accuracy, % (Test) |
---|---|---|
Ensemble | 53.6 | 58.14 |
Tree | 41.94 | 41.95 |
Efficient logistic regression | 41.94 | 41.95 |
Efficient linear SVM | 41.94 | 41.95 |
SVM | 41.94 | 41.95 |
Kernel naïve Bayes | 41.94 | 41.95 |
Model Type | Names and Values of Hyperparameters | Definition |
---|---|---|
KNB | Distribution of numeric predictors: Kernel | Specifies that at least one predictor has a kernel distribution. |
Distribution name of categorical predictors: MVMN | Some predictors are categorical and are specified to be multivariate, multinomial random variables (MVMN). | |
Kernel type: Gaussian | The density of kernel smoothing calculated using the Gaussian equation [39]. | |
Support: unbounded | This means that the density of support has real values only. | |
Standardized data: Yes | Each kernel-distributed predictor variable is centered and scaled by the software using the matching column’s mean and standard deviation. | |
Ensemble | Ensemble method: Bag Learner: Decision tree | The class labels or response variables used to train the ensemble of bagged decision trees can be supplied as a category, character, or string array; a logical or numeric vector; or a cell array of character vectors. |
Maximum number of splits: 5364 | The number of messages for learning was ~90% of the total number of messages of that class in the dataset (5961). | |
Number of learners: 30 | The default learner was used, with 30 learners. | |
Number of predictors to sample: Select All | The default (randomly chosen) number of predictive variables for each decision split, given as all. | |
Tree | Maximum number of splits: 100 | To decrease the computation time and model complexity, trees with a depth of 100 were chosen. |
Split criterion: Gini’s diversity index | Gini’s diversity index was the default. | |
Surrogate decision splits: Off | The dataset has no data with missing values; thus, no surrogate decision splits were used. |
Feature- Ranking Algorithm | Type of Model | Validation | Test | |||||
---|---|---|---|---|---|---|---|---|
Accuracy, % | Total Misclassification Cost | Training Time (s) | Prediction Speed (obs/s) | Model Size, MB | Accuracy, % | Total Misclassification Cost | ||
Chi-squared | KNB | 90.90 | 488 | 15.62 | ~2400 | ~3 | 89.09 | 65 |
Ensemble | 79.94 | 1076 | 53.35 | ~12,000 | ~49 | 81.21 | 112 | |
Tree | 71.48 | 1530 | 8.15 | ~39,000 | ~2 | 70.63 | 175 | |
ANOVA | KNB | 93.30 | 359 | 15.6 | ~2700 | ~4 | 91.44 | 51 |
Ensemble | 59.45 | 2175 | 66.15 | ~9900 | ~66 | 59.56 | 241 | |
Tree | 58.45 | 2229 | 10.87 | ~2500 | ~2 | 59.56 | 241 | |
Kruskal–Wallis | KNB | 88.53 | 615 | 28.71 | ~1800 | ~4 | 91.27 | 52 |
Ensemble | 64.13 | 1924 | 79.92 | ~9200 | ~8 | 62.08 | 226 | |
Tree | 63.46 | 1960 | 13.14 | ~3000 | ~2 | 61.57 | 229 |
Feature-Ranking Algorithm | Number of Selected Features | Type of Model | Accuracy, % (Validation) | Accuracy, % (Test) |
---|---|---|---|---|
Chi-squared | 22 | Kernel naïve Bayes | 93.04 | 92.28 |
ANOVA | 32 | Kernel naïve Bayes | 93.47 | 91.94 |
Kruskal–Wallis | 24 | Kernel naïve Bayes | 88.61 | 91.94 |
Feature-Ranking Algorithm: Chi-Squared with 22 Selected Features; Kernel Naïve Bayes Model. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Class Label | Validation | Test | ||||||||||
TP | FP | FN | Precision | Recall | F1 Score | TP | FP | FN | Precision | Recall | F1 Score | |
easy-ham | 2115 | 114 | 135 | 0.9489 | 0.9400 | 0.9444 | 233 | 11 | 17 | 0.9549 | 0.9320 | 0.9433 |
easy-ham-2 | 1171 | 205 | 86 | 0.8510 | 0.9316 | 0.8895 | 131 | 30 | 9 | 0.8137 | 0.9357 | 0.8704 |
hard-ham | 174 | 5 | 50 | 0.9721 | 0.7768 | 0.8635 | 21 | 0 | 3 | 1.0000 | 0.8750 | 0.9333 |
spam | 416 | 18 | 20 | 0.9585 | 0.9541 | 0.9563 | 47 | 2 | 2 | 0.9592 | 0.9592 | 0.9592 |
spam-2 | 1116 | 31 | 82 | 0.9730 | 0.9316 | 0.9518 | 118 | 3 | 15 | 0.9752 | 0.8872 | 0.9291 |
Feature-Ranking Algorithm: ANOVA with 32 Selected Features; Kernel Naïve Bayes Model. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Class Label | Validation | Test | ||||||||||
TP | FP | FN | Precision | Recall | F1 Score | TP | FP | FN | Precision | Recall | F1 Score | |
easy-ham | 2117 | 78 | 133 | 0.9645 | 0.9409 | 0.9525 | 232 | 8 | 18 | 0.9667 | 0.9280 | 0.9469 |
easy-ham-2 | 1196 | 187 | 61 | 0.8648 | 0.9515 | 0.9061 | 134 | 34 | 6 | 0.7976 | 0.9571 | 0.8701 |
hard-ham | 179 | 25 | 45 | 0.8775 | 0.7991 | 0.8364 | 19 | 0 | 5 | 1.0000 | 0.7917 | 0.8837 |
spam | 412 | 17 | 24 | 0.9604 | 0.9450 | 0.9526 | 47 | 3 | 2 | 0.9400 | 0.9592 | 0.9495 |
spam-2 | 1111 | 43 | 87 | 0.9627 | 0.9274 | 0.9447 | 116 | 3 | 17 | 0.9748 | 0.8722 | 0.9206 |
Feature-Ranking Algorithm: Kruskal–Wallis with 24 Selected Features; Kernel Naïve Bayes Model. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Class Label | Validation | Test | ||||||||||
TP | FP | FN | Precision | Recall | F1 Score | TP | FP | FN | Precision | Recall | F1 Score | |
easy-ham | 1983 | 76 | 266 | 0.9631 | 0.8817 | 0.9206 | 231 | 6 | 20 | 0.9747 | 0.9203 | 0.9467 |
easy-ham-2 | 1183 | 263 | 75 | 0.8181 | 0.9404 | 0.8750 | 132 | 18 | 7 | 0.8800 | 0.9496 | 0.9135 |
hard-ham | 142 | 55 | 81 | 0.7208 | 0.6368 | 0.6762 | 19 | 6 | 6 | 0.7600 | 0.7600 | 0.7600 |
spam | 412 | 65 | 25 | 0.8637 | 0.9428 | 0.9015 | 45 | 5 | 3 | 0.9000 | 0.9375 | 0.9184 |
spam-2 | 1034 | 152 | 164 | 0.8718 | 0.8631 | 0.8674 | 121 | 13 | 12 | 0.9030 | 0.9098 | 0.9064 |
Research | Approach | Method | Features | Privacy | Classes | Accuracy | F1 |
---|---|---|---|---|---|---|---|
Zavrak et al., 2023 [24] | Hierarchical attentional hybrid neural networks (HANs) | FastText + HAN model architecture | Three features: links, words, and emojis/emoticons | Not preserved | Spam, ham | 72.3–91.6% | 78.9–90.3% |
Nguyen et al., 2022 [25] | Homomorphic encryption (HE) and functional encryption (FE) | A feature sparsity-based information masking method | Varying the encrypted features’ vector length n between 2000 and 5000 | Spam classification at a server without decrypting the email contents | Emails without target words. | 75–80% | - |
Emails with target words | 90% | - | |||||
Kiamarzpour et al., 2013 [26] | Used Weka software and converted the data to the ontological format | Combining the output of several decision trees and the concept of an ontology | Waikato Environment for Knowledge Analysis (Weka) explorer | Not preserved | Spam | - | 84.6–94% |
Ham | - | 85.6–94.2% | |||||
Sathya et al., 2023 [28] | Deep learning algorithms, specifically CNN | The One-R method serves as a baseline for selecting the features | Pretrained CNN models, such as VGGNet or ResNet, were used to extract the high-level features from the visual data | Not preserved | Five categories for the detection of crime: very high, high, moderate, low, and very low | 99.01% | 98.76% |
Omotehinwa et al., 2023 [29] | Hyperparameter optimization | Random forest (RF) and extreme gradient boosting (XGBoost) ensemble algorithms | The class label of each of the emails and the text of each email | Not preserved | Spam, Ham | 97.48–98.09% | 97.58–98.16% |
Al-Hassan et al., 2023 [30] | DSpamOnto: an ontology-based spam detection model | Integrates a top-down methodology, with a mixed-based methodology | The MIB dataset is a collection of Twitter accounts | Not preserved | Social spambots, traditional spambots, fake followers | 69.99–80.29% | 59.25–75.79% |
Proposed | A domain-specific ontology for emails | Semantic parser | Dataset of email metadata | Preserved | easy-ham | 92.28% | 76.00–95.92% |
easy-ham-2 | |||||||
hard-ham | |||||||
spam | |||||||
spam-2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Venčkauskas, A.; Toldinas, J.; Morkevičius, N.; Sanfilippo, F. An Email Cyber Threat Intelligence Method Using Domain Ontology and Machine Learning. Electronics 2024, 13, 2716. https://doi.org/10.3390/electronics13142716
Venčkauskas A, Toldinas J, Morkevičius N, Sanfilippo F. An Email Cyber Threat Intelligence Method Using Domain Ontology and Machine Learning. Electronics. 2024; 13(14):2716. https://doi.org/10.3390/electronics13142716
Chicago/Turabian StyleVenčkauskas, Algimantas, Jevgenijus Toldinas, Nerijus Morkevičius, and Filippo Sanfilippo. 2024. "An Email Cyber Threat Intelligence Method Using Domain Ontology and Machine Learning" Electronics 13, no. 14: 2716. https://doi.org/10.3390/electronics13142716
APA StyleVenčkauskas, A., Toldinas, J., Morkevičius, N., & Sanfilippo, F. (2024). An Email Cyber Threat Intelligence Method Using Domain Ontology and Machine Learning. Electronics, 13(14), 2716. https://doi.org/10.3390/electronics13142716