1. Introduction
A social network is a networked platform or application that allows users to create personal profiles, connect with other users and interact with them via various features, such as text posts, images, video and comments. The aim of social networks is to establish virtual networks, making communication, information notification and interaction easier and accessible overall. Social networks have greatly affected our daily lives by shaping the way people communicate, get informed, and share their experiences. However, the evolution of social media introduces critical problems, such as fake news propagation and excessive digital exposure.
Some of the most popular social networks include Facebook, Instagram, X (Twitter), TikTok, YouTube, LinkedIn, etc. Instagram is a social network introduced by Kevin Systrom in October 2010 [
1,
2]. It offers its users the ability to share pictures and videos with other users’ or exchange messages. The first week it was released, the application counted 100,000 new users [
3], and in only 2 months it reached 1 million users [
4]. In 2012, Facebook (now Meta) bought Instagram for USD 1 billion [
3]. Although Instagram offers many facilities to its users, it mainly serves for sharing pictures and videos. Its users can process multimedia with various filters and organize them based on location via the use of hashtags. Users can sign up by creating a new account or profile. This is completely free, but requires potential users to register with some personal data, such as name, birthdate, biography, username (that the user would like to use), etc. Each account can be public or private. On a public profile, all posts, pictures and videos can be viewed by everybody, whereas on private profiles, the account owner can specify which users will have access to the data they upload. Based on the data posted on Instagram, trends are formed of topics or themes that are popular at a given time. Moreover, users can “follow” other users and comment on or “like” their posts. Each Instagram user has their personal homepage, where the posts of followers appear.
One of the Instagram aspects that has gained increased popularity is “Instagram Stories”, which was introduced in August 2016 [
5]. This function allows users to create temporary content in the form of pictures or videos, which appear for 24 h and are then removed [
2,
5,
6]. This operation has been quite popular, and it is a significant part of the user experience on Instagram. According to the company, over 500 million people [
5] use this service every day.
As is the case with every other social network, Instagram is also used for commercial purposes. It hosts accounts of companies and business stakeholders in general, allowing the promotion of goods and services. Moreover, it provides tools for commercial activities. Influencers, i.e., people with high follower populations, often cooperate with companies, facilitating the promotion of goods and services. In this respect, they use audio and video to communicate commercial messages to their audience. Furthermore, politicians exploit Instagram to highlight their political positions, promote their campaigns and share snapshots of their everyday life. Instagram has thus evolved to be a valuable tool for politicians, enabling direct communication with the general public, facilitating the promotion of political views.
Instagram is undoubtedly a very useful, practical, and easy tool in the hands of millions of people. However, the increase in its popularity has led to the appearance of fake accounts, a phenomenon that is unfortunately gaining momentum. Accounts that use any fake information or content that is not authentic and, directly or indirectly, aim to deceive people are considered to be fake. It is estimated that about 10% of the 2.4 billion accounts on Instagram are fake (2024). Fake accounts may have the following aims:
Malicious activity: A few users start fake accounts to execute malicious activities such as fraud, phishing or the blackmail of other users.
Spamming: These accounts are used to persuade the user to buy a specific good/service that appears continuously in posts.
Mass influence: Specifically, these accounts influence trends by publishing numerous messages of similar content with different phrases in order not to be recognized.
Increased followers: In certain situations, fake accounts are produced to increase the number of followers of another user, most of the time with the target of changing their popularity.
Selling followers or likes: There is also the case of selling or buying fake followers or likes, and fake accounts are used to offer this service.
The last two goals are very frequent, as they serve as means to establish influencers, the popularity of whom relies exclusively on the number of followers they have. The higher the number of followers, the more money they get paid to promote products or services. At the same time, as technology has dramatically evolved, it offers the ability to create content and algorithms on social media, affecting metrics to promote content to even more users. As far as the mass influence of social media is concerned, it is important that legitimate users are in place to protect the public and avoid the formulation of political or social views based on fake information. Instagram takes several measures to detect fake accounts and encourages users to report suspicious activity. Nevertheless, the percentage of fake Instagram accounts is still very high and more effective means need to be employed. This paper aims to present such means that facilitate the detection of fake Instagram accounts.
To be able to build models supporting the above and then detect fake Instagram accounts on the fly, sufficient publicly available social media data need to be identified and processed. But there are various limitations regarding the usage of public social media data [
7]. From a technical perspective, the identified data are almost never ideal for usage and require heavy processing before being useful. There are major GDPR [
8] and privacy implications in using such public data, while public self-reporting data are often biased, erroneous or incomplete, leading to unreliable available content. Data acquisition limitations are also a problem, as depending on the platforms used, the data services used cannot aggregate data owned by third parties, such as social media platforms. In this respect, third-party data collection is an important concern for scientific reproducibility, as often the raw query generated from query logs is proprietary and cannot be shared with other owners, or it may be available for a certain time for study but not later. Moreover, a major drawback is the lack of control over social media data and the fact that there are often changes in user behavior across social media, which negatively affects the process of data interpretation. Furthermore, there are major ethical concerns in using publicly available data. This paper does not require informed consent to access and process publicly available data, according to IRB guidelines. Therefore, the owners of public social media data may not be aware that their data are public or obtained and used by external actors. This is a significant ethical problem that is a result of the frequently unclear privacy management systems and policies in place, while the boundaries between public and private data and the extent that public data can be used for research purposes are often not clear in terms of social media.
The work presented in this paper demonstrates a significant advancement in detecting fake Instagram accounts using a relatively small dataset. We used several repositories, each with limited data. So, there was an effort to identify as many datasets as possible, ignoring the fact that each carried very few data, and then combine these in a larger dataset, identifying and matching common features. Unfortunately, this approach was not easy since numerous modifications were required in order to construct a reliable dataset that could be used for the experiments eventually conducted. Despite the limited data, this study achieved higher accuracy scores compared to other studies that utilized larger datasets. This success was accomplished through the application of various machine learning (ML) models, including Gaussian, Random Forest, Decision Trees, Logistic Regression, MLP, KNN, SVM, etc. The proposed approach involves feature engineering techniques, which enhance the performance of the ML models. As elaborated upon in the rest of this paper, the proposed method outperforms other studies in the field, indicating that effective fake account detection can be achieved with smaller datasets when advanced ML techniques are applied.
2. Related State-of-the-Art Work
This section aims to review the related state-of-the-art work regarding the detection of fake social media accounts. The first subsection focuses exclusively on Instagram, whereas the second one addresses other social networks, such as Facebook, X (Twitter), etc.
2.1. Instagram
In [
9], the researchers compared the performance, and more specifically the F1-score, of many machine learning algorithms, such as Naïve Bayes, Logistic Regression and SVM, but also one of their own neural networks, in order to identify fake accounts and automated bot accounts. Given the unevenness of the data used, 20% were fake accounts while 80% were real accounts. The researchers produced synthetic accounts via the use of the SMOTE-NC algorithm. They concluded that the SVM algorithm and their neural network showed the best performance, achieving 94%. They note that the existence of oversampling affected their results, giving a better appearance in relation to its absence, where the SVM algorithm exclusively showed a best F1 performance of 89%.
In [
10], the researchers proposed a machine learning model that was based on the Bagging algorithm in order to solve the targeted problem. They produced a web crawler for Instagram and, in combination with the Instagram API, they collected information and classified users’ accounts. They compared the performances of their model with popular classifying models, such as Random Tree, J48, SVM, RBF, L\MLP, Hoeffding Tree, and Naïve Bayes, in order to conclude that their model was the most efficient, since it achieved an accuracy of 98.45%, with a very small percentage of wrong classifications. Finally, they classified users based on each feature separately in order to identify the effect of each one on the final result.
In [
11], the authors conducted extended research, aiming to answer questions which had to do not only with the optimized way of separating the accounts into fake and real, but also with the features of the accounts and the account type (for instance, metadata, media info, etc.). They collected data under very strict criteria by choosing to gather fake accounts from companies that sell such accounts, and they also gathered real accounts via web-scraping, analyzing 24 Instagram accounts of private universities. The final labeling of the accounts—32.640 in total—was realized by a three-member committee. They executed two experiments using both the typical two-class classification (fake–authentic) as well as the four-class classification (authentic, active fake user, inactive fake user, spammer) via the help of the following algorithms: Random Forest, MLP, Logistic Regression, Naïve Bayes, J48 and Decision Tree. They found that, for both classification types, the Random Forest algorithm showed the best behavior, with 90.09% and 91.76%, respectively. At the same time, they concluded that metadata is the type of account feature that plays the most significant role in this kind of classification.
In [
12], the authors tried to identify both fake and spam accounts. To identify spam accounts and estimate their affect, they proposed two algorithms, the “InstaFake” and the “InstaReach”, selected features of which were input into an “Instagram fake spammer genuine accounts” dataset. For the selection of the specific features, they conducted some in-depth research and monitored their relation. In the same sense, they used all the features of the accounts that they had and fed their Deep Learning Neural Network they had constructed to identify fake accounts. The results showed an accuracy of around 91%.
In [
13], the researchers implemented a literature research, in which they depicted in detail what had been achieved regarding fake account identification on Instagram via the help of machine learning until the publication date. They presented the various methods that they had examined and also their results, without expressing a preference towards any of the existing methods or proposing a new method.
The authors in [
14] proposed a Deep Learning model for clustering accounts into real and fake ones. They used the total number of data of “Instagram fake spammer genuine accounts” and modeled a four-hidden-layer ANN with 34.752 trainable parameters. For each hidden layer used, they followed a dropout layer with a 0.3 parameter, and their model was trained for 20 epochs. The accuracy of the model reached 93.63% with 0.18% permanent loss.
The authors in [
15] used regression algorithms to solve the fake account identification problem. Random Forest and Logistic Regression were the choices of the authors. They used “Instagram fake spammer genuine accounts” as a dataset, as well as the following metrics: accuracy, recall, precision and F1-score, which comprised the final results of the research. The authors claimed about 92.5% accuracy using Random Forest and 90.8% accuracy using Logistic Regression models, without showing a preference for one of the two algorithms.
In another study [
16], the authors noticed that most of the implemented approaches were targeting the Random Forest algorithm. They used and proposed a system which exploited gradient boosting, a similar algorithm to Random Forest, which, according to the authors, made use of a basic advantage and could manage missing values’ inputs. They did not feed the metadata gathered via web scraping as final inputs to the algorithm, but the methods that were the result of them: engagement rate, artificial activity and spam commenting. The authors did not mention the performance of the current effort in raw numbers, but they implied that their proposition is better than the current propositions in the literature.
2.2. Other Social Networks
The author of [
17] proposed a machine learning approach that consists of two fully developed models: one is targeted at identity verification and account security and the other at text analysis and harmfulness, regarding posts and comments of users. The first uses the IP and MAC of users to check the possibility of the existence of multiple suspicious accounts and asks for authentication if needed, whereas the second implements SVM and classifiers in order to number the frequency of the “malicious/toxic” words in the text and uses Natural Language Processing (NLP) to manage foreign languages and data. The authors evaluate their approach across various social networks (Facebook, Instagram, Twitter, Youtube, WhatsApp) and indicate high accuracy in fake account identification.
The authors of [
18] proposed a modern classification algorithm named SVM-NN, which combines the known ML algorithm SVM with a neural network designed to better classify a social network’s users into real and fake ones. More specifically, the training results of the SVM were used to train the neural network and the testing results to be evaluated. The “MIB” dataset was used, which, during processing, was divided into four non-covered subsets fed by the three evaluated systems: the SVM, the neural network and their combination, SVM-NN. The accounts existing in these subsets of data were related to Twitter users. The authors observed that their SVM-NN model showed higher performance and better appearance in relation to remote elements, with the classification accuracy of the stored accounts approaching 98%. They also noticed that the various subsets of the initial data did not show the same behavior regarding the effectiveness of the final classification, which was the result of the feature correlation that each feature (characteristic) contained, as well as the nature of the algorithm chosen in each separation.
The authors of [
19] carried out extensive bibliographic research in November 2020 and presented in detail all initiatives focusing on fake account detection across all social networks. Given that there were numerous initiatives, they performed a targeted selection of what they would use based on criteria that would guarantee the correctness and validity of the research, such as the popularity of the publisher, the documentation of the publication, comparison with previous works and the usage of algorithmic machine learning models. Then, they collected all of the possible features used in previous studies and went on with the classification of publications based on such characteristics. They depicted the various methods examined, the related means and their results without showing any reference for any of the proposed methods. Moreover, they did not propose any new method.
In another research initiative [
20], the authors compared the performance of three machine learning algorithms to classify Twitter users into bots or real users. It is worth noting that based on the research the dataset used, the word “bot” can mean “fake profile”, which is the target of the current journal article. The following three algorithms were used: Logistic Regression, SVM and Random Forest. The total dataset was a subset of the following research work of [
21]. The dataset was preprocessed before being used to feed the algorithms, and 24 features were kept separated in classes. The Logistic Regression and Random Forest models showed the highest performances, with the latter exceling and achieving an accuracy of 97.9%. The authors identified that although the dataset and the algorithms could easily identify automated bot accounts, there was an intense difficulty in identifying human-operated fake accounts.
In another research initiative [
22], the researchers compared supervised models that could classify fakes accounts based on users’ emotions on Facebook. Similarly, they implemented an extensive analysis of emotions expressed by users in their posts, with the results being very interesting regarding users’ behavior on social networks. Twelve features were used to indicate emotions, such as anger, regret, happiness, fear, disgust, etc. The following models were used: SVM, Naïve Bayes, JRip and Random Forest. The following metrics were used: accuracy, F-measure and AUROC. Finally, Random Forest indicated the best performance in all three metrics, and was the one chosen by the researchers.
The authors in [
23] produced a model for the identification of fake accounts on LinkedIn. They claim that in order to separate the fake and the real profiles on social networks, websites’ features based on users’ accounts need to be identified. However, the access restrictions around private information made their work more difficult. For this reason, they experimented with a small set of data that comprised 74 samples in total. Specifically, 40 samples were real and 32 samples were fake. They were separated to datasets in 1:1 ratio for training and testing. The features were extracted via the PCA method and then were fed to NN, SVM and Weighted Average (WA) classifiers in order to classify real and fake profiles. Based on their results, the researchers proposed the SVM model, as it reached an accuracy of 87.34%.
The authors of [
24] proposed a model that uses an algorithm for pattern matching to identify fake accounts on Twitter. They gathered 62 million user profiles using crawlers, which then were processed to be easily manageable. During this process, a number of 724,494 sets were produced and 6,958,523 accounts were maintained in total. In order to improve the 724,494 sets, screen names were analyzed and those which were identical were identified. Finally, the update time of an account with new activity (such as a new post), the production time of the accounts, as well as the URL analysis of each account were very important in producing behavior prototypes to target users. The researchers completed the process of unifying the accounts with this method, based on features and sets with similar behavior. In conclusion, apart from the minor analysis that took place, a very reliable subset of fake accounts was specified and used via the use of map-reducing techniques and pattern recognition presented.
The researchers of [
25] tried to identify fake accounts in real time via the use of an extension/add-on to Google Chrome browser. They used machine learning techniques to achieve their goal, and they collected data by combining crawlers and Twitter APIs, as well as their personal research. The new idea proposed was based on the Random Forest and Bagging models, which demonstrated the best performance among the five algorithms they used in terms of the ROC and TP (True Positive) metrics, reaching a score of 99.4%.
3. Datasets Used in the Conducted Experiments
In this section, the datasets used for the extraction of the research results of the current work are presented. There is a combination of two separate datasets with similar features and attributes for describing account properties on Instagram.
3.1. The Dataset “InstaFake”
The “InstaFake Dataset” [
26] was collected by a specific research initiative [
9], and the respective contributors have provided it to any researcher via a public repository. It carries information about user accounts, which are classified according to the authors’ criteria into the following four categories:
Fake—malicious;
Real;
Automated;
Non-automated.
For the problem addressed in the study presented in this paper, the accounts of the first category are very interesting. These data, with their final downloaded files, are divided in two JSON files. Each one contains a sum of the accounts that belong to one of the two categories.
Table 1 presents some quantitative elements for these two files.
Table 2 below depicts in detail the features (columns) of the specific accounts along with a short description.
3.2. Dataset “Instagram Fake Spammer Genuine Accounts”
The “Instagram fake spammer genuine accounts” dataset [
27] was collected from a public repository. With a format and features similar to the ones previously analyzed, it seemed to be the ideal extension of “InstaFake Dataset”. It contains information about accounts which are classified in the following two classes:
The data, as they are downloaded, are divided into two .csv files, “train.csv” and “test.csv”, which contain the training data and testing data, respectively. Each file’s data consist of accounts with elements of the two categories. It is more than obvious that the ratios of fake (malicious) to real accounts are different depending on the solution/aim they are used for. In
Table 3,
Table 4 and
Table 5 presented below, there are some quantitative elements of these two files.
3.3. Final Dataset
To produce the final dataset used in the study presented in this paper, there was a need to modify the available datasets. These modifications aimed to achieve the optimum utilization of the common information.
Table 6 presents the quantitative elements for the final dataset.
The modifications that took place in the two initial datasets were the following:
The columns “nums/length username” and “username_digit_count” were used to generate the new column “hasDigitsInUsername”. This occurred since the information that they provided was interesting but not compatible with both datasets.
The columns “username_length”, “username_digit_count” and “InstaFake Dataset” were used to generate the column “nums/length username”, as follows:
The columns “fullname words”, “nums/length filename”, “name == username” and “external URL” of the total dataset “Instagram fake spammer genuine accounts” were deleted since there was not a method to assign the content to one of the columns of “InstaFake Dataset”.
In
Table 7, there is a detailed demonstration of the final features of the specific accounts and a synoptic description.
4. Fake Account Detection on Instagram
This section presents the proposed approach that aims to enable the detection of fake accounts on Instagram.
Section 4.1 presents the preprocessing of the available dataset in order to feed the various machine leaning models.
Section 4.2 discusses the feature correlation in a correlation matrix.
Section 4.3 analyzes the implementation of the algorithms followed by the authors of this paper. Finally,
Section 4.4 elaborates on the performance of each algorithm tested.
4.1. Data Preprocessing
In order to detect the highest possible success rate of the methods used for fake account detection, a technique known as feature engineering was exploited, which uses existing features to produce new ones. Although it is considered to depict something that does not benefit the training procedure of a machine learning model, it has been observed that when using the suitable feature combination, there might be a significant improvement regarding the final results of a study. Consequently, the study presented in this paper produced three new features. Their selection was made based on the current literature, and a significant increase in the success of the used models was shown.
4.1.1. Following/Followers Ratio
The following/followers ratio is a feature referring to the relationship between the number of accounts that someone follows and the number of accounts following them, which could provide some indication in relation to the authenticity of an account on a social network, but of course it is not an absolute indicator. The ratio is depicted in the following formula:
From a mathematical perspective, there are some constraints in the case where the number of followers equals zero. Based on the statistical picture in
Figure 11, it is obvious that real users have a very low following/followers ratio in comparison to fake ones. The best approach in this case was to match the following/followers ratio with the ratio appearing in the rest of the data (presented in next sections), which equaled 2000. Consequently, this approach cannot affect in any way the final result since only fake accounts follow such a behavior in the available dataset, facilitating this mathematical modeling.
Figure 11 depicts the fake accounts produced to a high degree with the target of increasing the following of third-party accounts in order to obtain benefits for the various reasons analyzed in
Section 1. The result was highly expected, given that there are companies which produce and sell such accounts in bulk. Obviously, the production rate of such accounts makes account management difficult; this benefits the real accounts that “need” something like so that they can increase their number of followers.
Of course, the following feature alone could indicate falsity, but a user’s profile cannot be removed simply because they choose to follow third-party posts on Instagram. Based on the sample and bibliography, there is the assumption that if an account usually follows about 150 other users, it is understood that 63% of the real accounts of the dataset that were used should have been deleted.
Related literature indicates that a legitimate account typically follows 150 or more other users. In the studied dataset, approximately 63% of real accounts do not meet this criterion, which suggests that they should have been falsely removed due to suspicious activity. To avoid this, further studies have been carried out that led to the introduction of three new features exploited that allowed the inclusion of real user accounts that follow less than 150 other users. Details on this are provided in the remainder of this section. Moreover, it is often the case that some users decide to create own accounts exclusively to attend/follow events for which social media accounts are created, e.g., online competitions, gatherings, and various meetups, etc. At the same time, they follow and make friends without being active (i.e., posting, commenting, liking, etc.) every day. These actions could bring more followers.
As a result, considering the previous parameter, a small number of followers could not provide an indication that an x-random profile was created with legitimate thought, since the kind of users previously described would be automatically blocked. If such a conclusion had to be considered from an arithmetic perspective, it could be observed in our sample that around 10% of real users would be blocked because of having a low number of followers.
Based on the following/followers ratio, one can more confidently tell the difference between real and fake accounts because of their distribution, since fake accounts disproportionately follow many more users than the number of followers they have.
4.1.2. Following/Posts Ratio
This is another criterion that can provide some indication for the authenticity of an account on a social network. It references the relation between the number of users that someone follows and the number of posts on their profile. It is depicted in the following formula:
Also in this case, there is some mathematical inconsistency, since there are many accounts without any posts.
Figure 12 shows that fake accounts tend to have higher values in this specific feature compared to real accounts. Contrary to the previous case, a unified approach could not be implemented since there were 48 real accounts which, if assigned with a high score, were likely to be wrongly classified. For this reason, it was decided that each account that falls into this category should be assigned a score equal to the mean value corresponding to such an account. As a result, the fake accounts take the value of 524,399, whereas the real ones take the value of 41,542 for the purpose of preventing any inconsistency, and thus, providing a satisfactory solution to the problem.
The analysis for this specific feature choice is similar to the following/followers ratio analysis. The same observations depicted in “Following/Followers Ratio” section appear in the current section as well. Different styles and points of view regarding using social media can lead to accounts with authentic content, which can provide wrong results if the two combined features become isolated. This could lead to the conclusion that 25% of real accounts should be deactivated because of having a small number of publications (<10).
4.1.3. Followers/Posts Ratio
This refers to the relation between the followers number of a user on a social network and the post number on their profile. This feature can be used in order to understand the way with which a user manages their activity on the social network and output some results for their authenticity. It is defined by the following equation:
In this case, it is possible to have the mathematical issue that existed with the accounts in
Section 4.1.2, where there are no posts, resulting in division with 0.
Figure 13, in contrast to the previous two cases, presents that fake accounts tend to have lower values in this specific feature compared to real ones. As in the previous case, a unified policy could not be applied since there were 294 fake accounts which, if attributed with the highest possible value belonging to real accounts, would likely be wrongly classified. For this reason, we decided to employ the same policy as in the previous subsection for each account showing a similar problem, and define as a feature value the average that appeared for the corresponding kind of account. As a consequence, the fake accounts were assigned a value of 48.98, while the real accounts were assigned a value of 179.90. As previously observed, the features could individually provide a view of account authenticity, but the results would be too precarious. By combining both, better combinations are delivered, and at the same time, this helps the classifiers used, offering extra features that facilitate correct classifications.
The results depicted in
Figure 13 may seem strange; however, the fake accounts have very few posts, with very small numbers of followers. Although the average value seems high, it should be highlighted that a standard policy of people that produce such accounts is to follow the fake accounts they create. It involves some effort to produce an appearance of authenticity without real content or activity. Thus, there is no comparison to what real accounts show. In this case, it is more common to have more followers than posts, especially when it comes to popular people. Based on research that took place in 2022, an average user does not frequently post over a week (0–3), particularly on Instagram, where stories are dominant, and of course, these posts are not taken into consideration.
4.2. Correlation Matrix
As was mentioned at the beginning of this paper, our target was to use as few features as possible for the input data to bring the best result. As a result of the preprocessing, the correlation matrix of the total features used can be seen below in
Figure 14.
It is obvious that by excluding the features “numberOfFollowers” and “followers/posts ratio”, all the other features take part in the identification process regarding account authenticity, showing a good relation (positive or negative) with the feature “IsFake”. This specific procedure aims to achieve the highest possible success rates. Due to finding one more fake account, it was decided to keep the “numberOfFollowers” and “followers/posts ratio” elements apart from the others. As will be presented in a following section, this helps to extract better results and increases the metrics in most of the algorithms analyzed.
4.3. Realization of Algorithmic Experimental Evaluation in Four Phases
This study’s final results derive from the comparison of machine learning algorithms, which are used to classify accounts into categories. The algorithms used were the following: Naïve Bayes (Gaussian), Random Forest, Decision Trees, Logistic Regression, MLP, KNN and SVM. These algorithms were used in four phases, as shown in
Figure 15. These phases were separated based on the structure of our dataset in a way that will be demonstrated in a following section. In each phase, the seven algorithms were fed with the same data and produced results that were evaluated based on the following metrics: accuracy, precision, recall, F1-score, AUC-ROC and log-loss. The nature and the specificity of each algorithm affect the results as the phases propagate.
In each phase, there is a quick demonstration of the differentiations that exist in the algorithms’ inputs, as well as the results, both in table demonstration and diagram form. As it will be presented, the SVM algorithm shows good results. For this reason, although it is presented in each results table, it is not presented in the relevant diagram, because the other low performances limit the capability of the optical comparison of the implementation results by the reader, something very important for such a presentation. Similarly, it is observed that the metric log-loss shows some abnormalities, since it gathers values exceeding the unit, making the typical demonstration of the results very difficult. However, the metric’s behavior in each phase is presented in a different diagram.
4.3.1. First Phase—Without Extra Features
In the first phase, the extracted results (
Table 8) were achieved by exclusively taking as input for the classifiers the initial sum of the dataset without adding any of the extra features referenced in
Section 4.1. In
Figure 16, the aggregated results of the first-phase evaluation are shown, and
Figure 17 shows the log-loss evaluation of the first phase.
4.3.2. Second Phase—Adding Following/Followers Ratio
In the second phase, the results extracted are presented in
Table 9,
Figure 18 and
Figure 19 and also consider the “following/followers ratio” features over the initial dataset.
4.3.3. Third Phase—Adding Following/Posts Ratio
In the third phase, the extracted results (
Table 10,
Figure 20 and
Figure 21) are derived by adding the “following/posts ratio” feature into the input as a classifier of the dataset of the previous phase.
4.3.4. Fourth Phase—Adding Followers/Posts Ratio
In the fourth phase, the extracted results (
Table 11,
Figure 22 and
Figure 23) derive from adding the “followers/posts ratio” feature into the input as a classifier of the dataset of the previous phase.
4.4. Evaluation Results for Each Algorithm
4.4.1. Gaussian Naïve Bayes
Concerning the results of the Gaussian Naïve Bayes algorithm, as depicted in
Figure 24 and
Figure 25, the following can easily be observed:
In the first phase, the recall is very high, meaning that the model correctly identifies many fake accounts, but the low precision shows that there are many real accounts that are evaluated as fakes ones.
In the second phase, it can be observed that there is a sudden increase in the metrics’ values. The log-loss has significantly decreased, suggesting better probabilistic evaluation. The addition of the feature “following/followers” seems to seriously affect the performance of this algorithm, validating our decision to include it in our research.
In the third and fourth phase, the model seems to show high performance in all directions, apart from the fact that in the last one there is a slight decrease in its metrics.
Overall, the Gaussian Naïve Bayes algorithm tends to perform better in various metrics after the addition of features, which make it more effective in identifying fake accounts.
4.4.2. Random Forest
As far as the Random Forest results are concerned, as depicted in
Figure 26 and
Figure 27, the results are the following:
The algorithm demonstrates stable and significant improvement in its performance during each phase, utilizing all extra features that were offered in order to improve its internal operations.
The algorithm demonstrates high accuracy, recall and F1-score, showing good classification capability as well as fake recognition identification.
The decrease in log-loss shows that the model improves its probabilistic evaluation and is optimal in producing probabilistic forecasting.
The AUC-ROC keeps high values (from 96.50% to 96.80%), thus showing a very good capability of the model to understand the classes.
Overall, it demonstrates significant total performance with stable improvement in various metrics. Its capability to provide high accuracy, stabilized precision and recall, as well as low log-loss makes it a powerful algorithmic model for the specific problem studied herein.
It is the best choice among the algorithms used for fake account identification.
4.4.3. Decision Trees
As far as the results of the Decision Trees algorithm are concerned, as depicted in
Figure 28 and
Figure 29, the following conclusions can be drawn:
The Decision Trees algorithm offers high performance in all metrics, with the capability to provide reliable forecasting and effectively separate the classes.
The stable trend of improving its metrics in the consecutive phases shows that the model makes good use of the additional features offered: it improves its internal procedures and produces better forecasting.
Overall, the Decision Tree seems to be an effective choice for the specific problem at hand, offering high accuracy and satisfactory balance between precision and recall.
4.4.4. Logistic Regression
Concerning the results of the Logistic Regression model, as depicted in
Figure 30 and
Figure 31, the following can be stated:
The model shows a stable increase in accuracy, ranging from 95.41% in the first phase to 97.00% in the fourth phase, offering a reliable and improved total classification.
It shows balanced performance between precision and recall (F1-score), showing the model’s capability to find the fake accounts and to decrease incorrect forecasting. In increasing the F1-score, the effect of the extra features added is of vital importance.
4.4.5. MLP
As far as the MLP results are concerned, as depicted in
Figure 32 and
Figure 33, the following conclusions can be drawn:
There is a stable improvement during the phases with the amazing help of the feature “followers/posts ratio”, which was presented in
Figure 14 as not having high correlation with the authenticity of each account. The results extracted in terms of the metrics’ values indicate that this model is the second best, following the Random Forest model.
The model shows exceptional performance with high accuracy, precision, recall, F1-Score and AUC-ROC.
The log-loss decreases significantly (from 1.59 to 0.95), showing probabilistic evaluation.
4.4.6. ΚΝΝ
As far as the KNN algorithm results are concerned, as depicted in
Figure 34 and
Figure 35, the following should be noted:
There is a stable increase rate in the metrics’ values during the four phases, with a higher increase in the third phase, which means that “following/posts ratio” drastically changes the combinations of the algorithm.
The log-loss has too high values in comparison to the previous implementations, accounting for the fact that KNN model requires a lot of data (entries, not features) in order to produce better results.
4.4.7. SVM
As far as the SVM results are concerned, as depicted in
Figure 36 and
Figure 37, the following should be noted:
In the first phase, the accuracy is 71.96%, which stands as an average value, but the low recall shows that there are many false-negative samples. Thus, there is a problem, taking into account that the situation examines the classification of fake and real accounts.
In the next three phases, there is some stability in the values, proving that the additional features did not affect the model’s performance from the third phase onwards. The increased accuracy and the slight increased recall, in comparison to the first phase, are positive clues showing that the addition of the feature “following/followers” helped a lot.
The AUC-ROC values are around 0.53 in all classes. This implies an average capability of the model to separate the classes. The same problem appears also for log-loss, which seems too high (close to 10) in all cases, possibly implying a problem in probability evaluation.
Overall, the SVM results are not satisfactory. However, the imbalance of the dataset as a whole, as far as the dataset volume in each class is concerned, shows a ratio where the metrics fall to considerably low levels.
4.4.8. Generalizability of the Findings
As far as the generalizability of the findings are concerned, SVMs in general are used when datasets can be linearly separated, but there are configurations for non-linear problems via the use of kernel functions.
In Decision Trees, the explanation of the values is related to the methodology selected, but in general, high values indicate the serious engagement of a feature in the model’s prediction capability.
In MLPs, in general, the first layer has the same number of neurons as the number of features of the given inputs. The in-between layers can contain various numbers of neurons in relation to the complexity of the problem and the requirements of the model. The last layer has one neuron for each class or prediction it is trying to achieve.
5. Comparison with Related State-of-the-Art Work
As far as a results comparison is concerned, the following
Table 15 sums up a comparison between two different works and the study presented in this paper:
As one can easily observe, the results presented in this paper demonstrate higher accuracy compared to similar approaches when Logistic Regression, KNN and Decision Trees algorithms are exploited. Nevertheless, lower accuracy is observed in terms of the SVM algorithm and Random Forest results.
The superiority of the current work in relation to other research papers is the fact that, via the use of smaller public repositories, most of the results regarding accuracy supersede the current year’s (2024) SOTA results, as can easily be seen in
Table 15. This is due to the fact that there is difficulty in using public repositories of fake accounts due to new laws regarding the use of private data.
6. Discussion
In the study presented in this paper, the goal was to identify fake accounts on Instagram via the help of machine learning algorithms. Given the general difficulty of finding such kinds of accounts, because of the strict new legal rules related to data processing, we made the decision to follow an alternative approach. So, in contrast with previous research papers, which are based on a substantial number of features, the research work presented in this paper was based on a mitigated number of publicly available features for each user in order to achieve our results. The decision was taken with the rationale of future exploitation of the current work, since even today it is necessary to use a limited amount of data in order to evaluate with a great degree of success a user’s identity in the enormous pool of a social network. As it is more than clear, the above-described effort completes, in an essential degree, the current research gap of the year 2024, as far as the detection of fake Instagram accounts via machine learning techniques is concerned.
The implications of feature selection on detection accuracy are substantial, as demonstrated in recent research on fake account detection. Careful feature engineering allows for a more accurate and computationally efficient model by isolating the attributes that best distinguish fake accounts. For instance, introducing ratios such as the following/followers ratio, the following/posts ratio, and the followers/posts ratio significantly improved classification accuracy by capturing behavioral patterns unique to fake profiles. In the preliminary experiments, each additional feature selection increased the model’s accuracy incrementally by up to 7%, highlighting the effectiveness of selective feature inclusion. Additionally, a correlation matrix analysis revealed that attributes like bio length and username digit count, which initially seemed relevant, had weaker associations with the “IsFake” label, and they were ultimately excluded to reduce noise. This systematic approach to feature selection not only streamlined the model, but also amplified its robustness, underscoring the critical role of carefully chosen features in achieving optimal detection accuracy for machine learning models in social media applications.
7. Conclusions and Future Work
In this section, the results of this article are summarized, providing suggestions, ideas and thoughts for future realizations related to the detection of fake accounts both on Instagram and other social networks. In this paper, various machine learning approaches were employed to detect fake Instagram accounts. With its appearance in 2010, Instagram put great emphasis on utilizing pictures and videos in the best possible ways in order to attract a lot of users. Thus, as time passed, the rapid development of technology and changes in society resulted in more users being active on Instagram, since it provided a new means of communication, in addition to snapshots and the ability to publicize moments of everyday life, which can be applied in the commercial sector to target millions of users. However, there are users who exploit this means of communication for illegal purposes. Fraud, bullying behavior, misinformation, influencing public opinion, and the promotion of third parties are some of the reasons why such accounts are created.
Seven machine learning algorithms (Gaussian Naïve Bayes, Random Forest, Decision Trees, Logistic Regression, MLP, KNN and SVM) were used in order to recognize such kinds of accounts with high accuracy. In order to achieve this, publicly available data were collected and were gradually enriched with more features (characteristics). These features comprised combinations of already existing features, and it was proved in the final results that they played a decisive role in improving the metrics (accuracy, precision, recall, F1-Score, AUC-ROC, log-loss).
An overview of the main results elaborated upon in this paper is presented below:
Algorithms with similar internal operation (Random Forests, Decision Trees) demonstrated a similar response during the phases, responding similarly to the additionally proposed features.
Overall, the proposed features significantly increased the values of our metrics. This proves that by utilizing the features that a dataset already has, results can be improved without having to incorporate less popular features. The same result can be achieved with few but qualitative features, saving time during the models’ training.
There were models that did not necessarily improve in the fourth and last phase (Gaussian Naïve Bayes, Logistic Regression), showing that the choice of many features per input does not make an ideal solution for a problem. Many models (Random Forest, MLP, KNN) had better results because of their different internal structures, showing that the approach of not leaving out the last additional feature was correct, demonstrating zero correlation with the authenticity of the account.
The Random Forest model seems to be the best solution proposed for detecting fake accounts. With the exception of the precision value, in all other values it excelled, reaching a classification accuracy of 98% (97.71%). Similarly, the rest of the metrics reached high levels, with F1-Score reaching 95.74%. These values appeared during the fourth phase with all the features available in the algorithm.
The authors aim to evolve the presented work in several directions. Indications of their respective future plans are listed below:
The results presented in this paper can be used in order to produce frameworks and applications available to Instagram users and every such user can thus check the authenticity of accounts they interact with in Instagram.
It could be possible to implement specific results or models in the context of similar studies on less popular social networks where the available data will also be small in number, so that such malicious actions could be detected/prevented.
Another direction could be identifying fake accounts on websites or in tutorials (restaurants, hotels, etc.), where fake accounts also appear and behave in the same way as on social networks.
A final option could be the use of the used algorithms with unsupervised learning to solve the same problem. Large volumes of data exist, but labeling is a very demanding procedure that consumes time and labor hours.
In summary, this research paper focused on the detection of fake Instagram accounts, proposing solutions to protect users from such malicious entities. Specifically, via the use of machine learning algorithms, solutions were proposed that can be exploited in technological implementations, thus developing a prevention mechanism against such behaviors.
To achieve this, datasets of public Instagram users were collected from the internet, and were then processed by adding or removing features before being fed into the machine learning models. A key finding from the results is that the research work presented in this paper can be compared to similar studies, but with the distinction that we achieved comparable, if not superior results using smaller datasets. While smaller datasets typically complicate the extraction of robust conclusions, our research successfully navigated this challenge. It is worth highlighting that the research work presented in this paper compares the obtained results mainly with similar approaches that use more features. Last but not least, it should be noted that the proposed approach outperforms existing fake account detection approaches as coupled with six well-established features selected and adopted; it also uses three newly introduced features that overall result in achieving higher overall accuracy.