1. Introduction
Software Fault Forecasting (SFF) is a methodology to enhance software quality and minimize software testing expenses by developing divergent categorization or classification mechanisms utilizing diverse machine learning techniques. Numerous software development companies aim to anticipate issues to uphold software quality for customer satisfaction and to economize on testing expenses. Using Deep Learning (DL) methodology with historical data is a common practice for predicting suspicious code blocks in the software development life cycle. The structured method is a procedure that facilitates the development of software with superior quality while keeping costs low and meeting customer expectations within a short timeframe [
1]. Federated Learning (FL) has emerged as a promising approach to overcome data isolation by enabling distinct sites to train a global model cooperatively rather than directly sharing data for training [
2].
The mission of SFF is to ensure the provision of software of superior quality and reliability while fine-tuning the utilization of scarce resources. Consequently, it will be feasible for software developers to give precedence to the usage of computing resources at every stage of the software development lifecycle. Numerous organizations involved in software development seek to anticipate software defects to preserve software quality, enhance customer satisfaction, and reduce testing expenses. The use of the software development process is aimed at improving the quality of software [
3]. Diverse classification models can be constructed using various machine intelligence techniques to conduct testing more efficiently. Different machine intelligence techniques have been explored to predict erroneous code statements in software modules to improve software quality and minimize software testing expenses [
4]. The quantity and length of assessments administered significantly influence the efficacy of tests. Insufficient testing of software may result in the persistence of undetected faults, which could subsequently manifest as undesirable behaviors in the future. Excessive software testing can result in project delays and budget overruns due to unforeseen expenses. Early detection and rectification of software defects can reduce project costs and mitigate the risk of project duration overruns. Software developers can determine the software’s fault susceptibility by analyzing its code during the initial phases of development through the proficient utilization of software metrics [
5].
Software enterprises are endeavoring to develop software modules that are devoid of errors. The software’s efficacy has been compromised due to its inherent defects, rendering it incapable of executing tasks precisely and effectively. The identification of software defects is considered the most crucial phase of SFF and requires extensive testing [
6]. This stage is of utmost importance as it ensures the quality and reliability of the software product. Effective defect management enhances the quality of software solutions and fosters a culture of quality consciousness throughout the project life cycle, leading to a sustained enhancement of deliverables. Identifying and fixing any software problems as soon as possible is essential to enhancing the program’s dependability and practicality [
7]. The Federated Learning-based Software Fault Forecasting model is shown in
Figure 1.
In federated learning, the global model is built collectively based on the distributed local models, and the local models are built on the locally available instances. In the process, each device developed an individual local model by exclusively utilizing its local dataset. The local models were transmitted to a central coordinator and subsequently consolidated into a global model, which was then redistributed to all participants for either inference or additional training purposes. The primary objective of FL is to facilitate collaborative efforts among participants, leading to the development of a superior model compared to individual efforts while ensuring the preservation of data privacy. It is accomplished by mandating that participants exchange model parameters rather than data. The main contribution of the current study is listed below.
This study identifies features based on the Gaussian probability density function that would assist in accurately identifying the features across various programs independently based on the distribution to recognize potentially vulnerable statements within a software program.
The features are assigned feature weights that are significant in deciding the feature contribution in the classification procedure, and the weights are updated over the training rounds by evaluating the test cases.
The Spider Monkey Optimization algorithm is used in updating the statement ranks for every individual code from a global perspective, resulting in a precise ranking of the statements.
The global feature weights are updated from the weights in the local model in the federated learning setting to build a robust fault prediction model.
The statements are assigned vulnerability scores according to their overall ranking among the program’s code statements, updated over the training process, and the ranks would assist in localizing the suspicious code blocks.
Statistical analysis of the proposed federated learning-based software fault forecasting model with other conventional approaches concerning standard metrics like Sensitivity, Specificity, Accuracy, F1-score, and the RCC curves.
The rest of the manuscript is organized as follows:
Section 2 presents the literature on existing software fault forecasting models and federated learning models.
Section 3 presents this study’s background, discussing feature selection, scaling, dataset description, and the implementation environment.
Section 4 presents the proposed model, discussing the fine-tuned spectrum-based fault localization mechanism and the weight updating mechanism in federated learning.
Section 5 presents the experimental results and discussion. Finally,
Section 6 offers this study’s conclusion and future research directions.
2. Literature
Many studies have previously been conducted for fault identification and prediction that have been studied and experimented with using conventional machine learning and DL approaches. Their Various targets are implemented for bug detection in the code. Some strategies focus on the vulnerable lines of code, using classification techniques to differentiate, using features for classification, run-time profiling, program log reports, breakpoints, etc. Nevertheless, there is a demand for a framework that could forecast the suspicious code block well in advance using run-time profiling and semantic features, which would assist in precisely classifying the suspicious portion of the software. A popular technique for defect prediction involves utilizing a classification algorithm to partition the source code into two distinct categories, namely, defective and error-free code [
8]. Despite this, methodologies reliant on manually constructed characteristics frequently fail to encapsulate the semantics and syntax of the code block adequately. Conventional code metrics lack the ability to differentiate between code fragments that possess identical structure and complexity yet execute distinct functionalities. Rawat and Dubey [
9] have conducted research that presents several models aimed at enhancing software quality. Their study involves an analysis of the factors that impact software quality as well as strategies for improving product and overall performance in the context of software. The study examined several metrics related to size and complexity and various computational models, including Bayesian belief networks, genetic algorithms, and neural networks, among other options.
Challagulla et al. [
10] have analyzed the impact of ML and statistical models in assessing software quality. Experiments were conducted on four separate real-time software fault repositories using various prediction approaches. The results showed that the rule-based classification approach using the 1R technique and instance-based learning, when combined with the consistency-based subset assessment mechanism, outperformed other models in terms of precision. Based on their outcome, the authors demonstrated a comprehensive software defect analysis tool for analyzing flaws and monitoring software modules in real time. Divya Tomar and Sonali Agarwal [
11] have presented a study on the prediction of faulty software modules based on class-imbalanced learning that is efficient in dealing with imbalanced datasets. The issue with learning from imbalanced data is that the minority class is not afforded the same level of attention by the learning approach as the majority class. In the context of imbalanced datasets, the learning algorithm produces classification rules tailored to the atypical class by either generating specific rules or addressing the issue of missing ones altogether. The lack of generalizability of these rules to novel data renders them unsuitable for predictive purposes.
Wu et al. [
12] have introduced a semi-supervised technique for dictionary learning in their study, which involved utilizing both labeled and unlabeled defect datasets over 16 different projects. Moreover, their approach considered the expenses incurred due to classification errors while performing dictionary learning. But the dictionary-based process needs high-quality data for training. It is also imperative to note that this aspect is contingent on the context and cannot be universally applied, necessitating a case-by-case examination. The process entails including an additional computational burden to identify the most salient features, which may impede the existence of correlated or insignificant features. Yang, X, et al. [
13] have presented a generative model based on Deep Belief Networks (DBNs), which rely on a neural network that operates across multiple levels. This architecture enables the DBN to learn and represent complex patterns in the data it is trained on. The architecture of this network comprises a singular input layer, a singular output layer, and a multitude of hidden layers. The output layer generates the feature vector, which represents the input data. Each stratum comprises stochastic nodes. A crucial characteristic of the DBN is its restricted connectivity, whereby nodes are exclusively linked to nodes in adjacent layers, not those within the same layer. The primary limitation of the DBN lies in its inadequate ability to effectively encapsulate the contextual information of code elements, including but not limited to the sequential execution of statements and calling upon functions. The primary limitation of the DBN lies in its inadequate ability to effectively encapsulate the contextual information of code elements, including but not limited to the sequential execution of statements and calling upon functions.
Numerous studies based on the Convolutional Neural Network (CNN) specialize in multiple convolution filters for data processing, with this network defined by two important characteristics. The local unit connection pattern is first duplicated throughout the network. The network captures the short-term structural context of the source code. Second, all units have the same parameters. Regardless of its location, the network may learn code element information. Zhou Xu et al. [
14] investigated defect detection using CNN over triplet loss and weighted cross-entropy loss approaches. Another work, by Qiu et al. [
15], uses the CNN model to provide a feature-learning approach. The model is intended to choose characteristics from token vectors in the Abstract Syntax Tree (AST) of the code. It then goes on to learn transferable joint characteristics. Integrating deep-learning-generated characteristics with hand-crafted ones allows the technique to effectively conduct cross-project fault prediction. The fundamental disadvantage of these models is that they grow increasingly complicated as the dataset size increases, and they must be trained for a substantial number of epochs to recognize the features with good accuracy.
Mcmurray and Sodhro [
16] experimented with a defect detection mechanism for security-related traceability in Smart Healthcare Applications, and the model performed reasonably well across divergent machine learning techniques such as Principal Component Analysis (PCA), Partial Least Squares Regression (PLS), and Feature Selection. Srinivasa Kumar et al. [
17] conducted a study on software fault detection and recovery for business operations using an independent program to examine and recover back to normal functionality using the test cases more conventionally. Batool, and Khan [
18] have proposed a software fault detection model using deep learning models based on Long Short-Term Memory (LSTM) and Bi-directional Long Short-Term Memory (Bi-LSTM) [
19] and Radial Basis Function Network (RBFN) for fault prediction. And the performances of the deep learning models are compared with other conventional approaches, and the experimental results have proven that LSTM and Bi-LSTM have yielded better accuracy of 93.66% and 93.45%, respectively. Still, the RBFN has yielded 82.18% accuracy, which is reasonably faster than the earlier two deep learning models. In the study on software fault prediction by Borandag [
20], a deep learning model is based on Recurrent Neural Networks (RNNs) and Ensemble techniques and learns across divergent datasets; the proposed model yields an accuracy of 95.9% on one of the datasets considered in the experiment.
The above-discussed studies are just a few approaches used in software fault localization. And all the approaches are local models, where they are locally implemented over software, and the models could classify the suspicious code block they are trained for. It is desired to have a global model that could handle divergent software models with distinct erroneous classes; definitely, a global model is needed. In the current study, a federated learning-based model is being designed to address the requirement for a unified global model to deal with the divergent error classes.
3. Background
The current section of the manuscript deals with the preliminaries of the proposed fine-tuned spectrum-based fault localization technique, which includes information about the feature selection and scaling mechanisms, information extraction, dataset description, and implementation environment, followed by the implementation of federated learning.
3.1. Feature Selection and Scaling
This study uses the Gaussian Probability Density Function (GPDF) [
21] in feature selection. The Gaussian distribution is a prevalent form of continuous probability distribution. Gaussian distributions are statistically significant and frequently employed in feature analysis to depict random variables with real values. GPDF is widely used because it is the probability density function that emerges as a limit for the sum of random variables. It has been observed that, regardless of the probability density function of the individual variables, the probability density function of a combination of random variables that are independent resembles a Gaussian distribution as the total number of variables being summed increases [
22]. The mathematical formulation of GPDF is shown in Equation (1).
From the above equation, the notation
designates the distribution variance that denotes the centralized degree of GPDF, and
designates the distribution mean over the input variable
. The probability distribution function across the range of intervals
is shown in Equation (2).
The probability range
designates the lower bound of the GPDF, and
denotes the upper boundary of the probability range. Hence, it is plausible to consider the probability of 𝑥 as an approximation of the integral from
to
, which can be evaluated using Equation (3).
The assessed value would be the initial feature probabilistic values assigned to the features, and their corresponding values are consistently updated over the iterations. The feature scaling technique is used to standardize the range of values. The process of feature scaling is employed to normalize the range of features present in the input data set. The input program’s feature set encompasses diverse values during the learning phase, with a simultaneous reduction in the loss function. The scaling process is executed iteratively to expedite and accurately achieve the global or local optimum in the localization algorithm. The present investigation involves utilizing Min–Max normalization to scale the feature values within the range of 0–1. The Min–Max normalization technique offers several advantages over conventional scaling methods. Min–Max scaling can effectively manage feature distributions that deviate from the Gaussian distribution. The Min–Max normalization technique addresses the issue of precision loss in a gradient optimization method that aims to converge toward the global solution [
23]. The process generates target values within the range of 0 to 1 through the utilization of the minimum and maximum values of the column. The corresponding mathematical formula is shown in Equation (4). The new feature weight value is considered the feature’s current value, as shown in Equation (5), for further processing.
The variable represents the newly normalized values from 0 to 1. The variable corresponds to the lowest values related to the feature, while represents the highest value related to the same feature. The symbol represents the data sample that is being considered.
3.2. Information Extraction
The direct extraction of textual information from code content may pose a challenge in retrieving crucial information due to the complexity of the code file. The text contained significant erroneous data, including comments and function descriptions. The token sequence exhibited concise information and lucid content, facilitating mapping to the code content. Thus, the retrieval of information from code text can be accomplished by utilizing token sequences. The attention module is a technique used for mining key features in the text. It can automatically recognize the significant features within the text data. The use of the attention mechanism has gained significant traction in natural language processing. Thus, the utilization of the attention module may facilitate the task of extracting textual information from the token sequence. The procedure by which the attention module extracts textual information from a set of queries
using a set of keys
over the value
as shown in Equation (6) [
24].
Initially, the tokens were mapped onto a space with high dimensions. A key-value pair was created and utilized as an input parameter to the attention model to represent the vector’s value in a token. The attention score was calculated based on the degree of similarity between the query and the key. Using the Softmax function, the attention module generates a vector representation of the textual information in the code.
3.3. Dataset Description
Multiple datasets are being considered in the current study to evaluate the performance of the federated learning model. Each local model is designated to handle a particular dataset type, and the global model combines multiple local models. The local datasets include the NASA-Metrics Data Program (MDP) repository [
25,
26], which has CM1, MW1, PC1, PC3, and PC4 instances. The PROMISE repository that provides for KC1 and JM1 instances and the Unix Utility Programs (UUP) repository with gzip, sed, flex, and grep instances are considered in the current study for evaluation.
The fused dataset in the NASA repository comprises 3579 instances and 38 attributes in total. Each of the datasets that have been chosen represents a distinct software component, and the instances contained within the dataset are indicative of the various software modules. The features denote the software metrics documented throughout the development process. The fused dataset comprises 38 features, with one feature designated as the output class for prediction and the remaining 37 features utilized in the prediction process. The output classification determines the existence or absence of defects in the module under consideration. PROMISE repository consists of 9793 instances, of which 9593 are JM1, 200 are KC1 instances, 1759 defects in JM1 instances, and 36 instances of KC1 are defective [
27].
The gzip utility accepts a set of 13 distinct parameters as input, in addition to a roster of files to be compressed. The software exhibits significant functionality, as evidenced by its 6573 lines of code and 211 test inputs. The Sed utility is employed to perform minor alterations to an input sequence. The primary application of this tool is to analyze textual input and implement alterations to the information as directed by the user. The program comprises 12,062 lines of code and encompasses 360 test inputs. The function of the flex program is to perform lexical analysis. The input files were generated from regular expressions and C code rules. The total number of lines of code amounts to 13,892, while the number of test inputs provided is 525. The grep command accepts two input parameters: patterns and files. Lines from any file that match any of the given patterns are printed by the program. Lines of code are used to quantify the amount of code produced in a software program. The dataset has 12,653 items and 470 variables. The summarized information on instances associated with various software fault repositories is shown in
Table 1.
3.4. Implementation Environment
The test cases evaluate the reliability of the code excerpt across various parameters, such as disparate inputs and operational circumstances. The assessments are conducted locally using dedicated software deployed on a standalone computer.
Table 2 presents the specifics of the experimental setting in which the experimentation is conducted.
3.5. Implementation of Federated Learning
The utilized technologies augment a self-contained, microservice-based, and fortified infrastructure for operational settings that necessitate the implementation of federated machine learning resolutions. The technologies mentioned above facilitate the aggregation of readily available services and third-party libraries, collectively constituting the stack of open-source tools that underpin the platform. Docker has been chosen as the primary tool for managing images and containers, serving as the initial level of abstraction that impacts all platform modules. Virtualization enables secure resource management through the implementation of hardware-agnostic and isolated execution. In addition to ensuring the secure implementation of federated tasks, it safeguards the host from user-generated code.
The implementing-level federated execution layer is built upon the Flower library, which offers robust functionalities that ensure efficient processing of computing modules without requiring specialized libraries for algorithm production. How communication is conducted varies based on the intended purpose and substance. If model parameters are being communicated, the gRPC protocol (implemented via Flower) is utilized, significantly improving the (de)serialization phase. Without other requirements, RESTful actions are deemed sufficient to effectuate modifications to the state of nodes, whether through user-to-server or server-to-server interactions. In addition, a web-based Graphical User Interface (GUI) is incorporated into the system, which is implemented separately from the core API using Jinja templates. The API Gateway, Kong, consolidates various internal paths into a singular port, enhancing the system’s usability and facilitating the exposure of external endpoints. The Kong proxy process is capable of handling gRPC and HTTP/1 protocols.
4. Proposed Methodology
The current section of the manuscript presents Spider Monkey Optimization-based fine-tuned Spectrum-Based Fault Localization, which assigns ranks to suspicious code blocks in fault forecasting. The current section discusses the weight update mechanism in federated learning.
4.1. Fine-Tuned Spectrum-Based Fault Localization
The conventional spectrum-based fault localization is finetuned using the Spider Monkey Optimization algorithm for a better-refined ranking model. The more precisely the rank assigned to the code blocks, the better the accuracy of the fault forecasting model would be. The spectrum-based fault localization model employs a mathematical formula that is specific to the method to assign a potential vulnerability ranking and possible fault to each trace of program components, such as expressions, code statements, initialization statements, assignment statements, branching statements, and evaluation statements, that are gathered for each test case. The suspiciousness rank is a metric used to assess the probability of a statement or a code snippet within a software program being faulty. By employing a spectrum-based approach for fault localization, the dependency information of each code snippet is scrutinized during the execution of test cases. The suspicious code block assessment estimates suspicious ranks for each program element by integrating correlation and dependency information [
28]. The fine-tuned spectrum-based fault localization framework is shown in
Figure 2.
The frequency of successful and unsuccessful test case executions, relative to the number of executions, determines the ranking system. Let us assume that 𝐶 identifies the code block, consisting of a set of code blocks with elements
., where it is assumed
. Equation (7) presents the formula utilized for rank assessment. The probabilistic feature values, i.e.,
are assessed during training. The rank of the code block is assessed based on the number of test cases that a particular block of code can pass. The total sum of failed test cases is identified by
.
From the above equations, the notation denotes the failed test case associated with the code block, and the notation denotes the test cases that are accepted. The obtained rank is normalized using the Spider Monkey Optimization algorithm. In conventional studies on ranking the code blocks, the code blocks are ranked based on the test cases. The rank of the statements is confined to the program where they are included, and the local ranking algorithm would not yield a better result in the federated learning environment. Similar test cases and code statements would be encountered in more than one program. The statement’s ranks are refined through the SMO algorithm to maintain the global ranking mechanism.
The identification of updated ranks for the subsequent phase of fault forecasting is being accomplished through the utilization of Spider Monkey Optimization, which considers local and global best practices. The fitness of the search space at the outset is established through a random selection process of those who initially make up the population. The formula presented in Equation (8) assigns the updated rank values identified by the notation
.
From the above equation, designates the perturbation rate associated with the statement ranking. The notation corresponds to the local best rank associated with the code block, and corresponds to the global best rank. The notations and are the two values used in normalizing the local best and global best ranks, as shown in Equations (9) and (10). The function 𝑟𝑎𝑛𝑑( ) is utilized to generate an equal distribution of the computed rank value within the interval of 0 and 1. The updated ranks are concerned with the local best rank and the global best rank and assist in amending the ranks to the code blocks in a precise manner that is based on the overall grading, which would ease the forecasting of erroneous code blocks in a normalized manner.
4.2. Feature Weight Upgradation at Federated Server
This is the crucial phase of the proposed FEDRak-based approaches for software fault forecasting. The feature weights are updated using federated learning rather than sending the data over the internet. The updated feature weights would be fed as input to the server to upgrade the global model [
29]. For ease of understanding, the weight of the feature is identified by
and is associated with two different approaches for two different clients. The weights at each of them are designated as
which correspond to two clients
and
that concern two different algorithms
and
that are used in software fault forecasting. Equations (11) and (12) present the weights associated with both algorithms.
In the above equations, the notation
designates the feature weight at the first row and last column, i.e., the
column for the first algorithm. Similarly, the variable
denotes the weight at the last row, i.e., the
row and first column matrix generated by the first algorithm. The notation
represents the weight in the first algorithm’s last row and column. The notation
denotes the feature weight in the first row and last column, i.e., the
column for the second algorithm. Similarly, the variable
denotes the weight at the last row, i.e., the
row and first column of the first algorithm. The variable
denotes the weight in the second algorithm’s last row and column. The ideal weights for the input to the processing layer in a federated server context are denoted by
, where the notation
is a combination of both the algorithms, i.e.,
and the corresponding formula for idea weights are shown in Equation (13).
In the case of a single client denoted by
that is associated, the feature over an algorithm
is shown in Equation (14).
The present aggregation encounters a challenge about the matrix’s addition property, as the matrix’s addition to the dimensions necessitates consistency. Equation (9) unambiguously indicates that adding locally trained matrices is unfeasible due to their disparate dimensions. To address this matter, the dimensions of all relevant matrices must be identical. To accomplish this task, it is necessary to concatenate a matrix of zeros with each relevant matrix [
30]. The objective is to assess the highest possible number of rows across all clients that have undergone local training, as shown in Equation (15), and the columns are presented in Equation (16) over two distinct dimensions of matrixes
and
.
The process of embedding the zero matrices alongside each optimal weight matrix will be executed using the subsequent methodology. The matrices mentioned above with zero values will be subjected to horizontal concatenation with the weight of each model trained locally. The matrices with zero values will be subjected to horizontal concatenation with the weight of each model trained locally, as shown in Equations (17) and (18) [
31]. The weight updating mechanism can be depicted in
Figure 3.
The horizontal concatenation of the matrices is illustrated as shown in Equations (19) and (20).
The result of Equation (21) will be utilized to acquire a global model, assuming both metrics are of identical size.
From the above equations, the notation is used to update the weights when the clients’ matrices have similar dimensions, and denotes the weights when the matrices have divergent dimensions.
5. Results and Discussion
To evaluate the effectiveness of a fault forecasting methodology, it is imperative to employ suitable metrics like sensitivity, specificity, F1-score, accuracy measures [
32], and ROC curves. A statistical analysis of the performance of the proposed software fault forecasting model is performed with other contemporary techniques used in other studies like Naïve Bayes (NB), Artificial neural network (ANN), Decision Tree (DT), Random Forest (RF), Fuzzy Logic-Fused (FLF), Bayesian Network (BN), Support Vector Machine (SVM), Synthetic Minority Oversampling Technique (SMOTE), Heterogenous Ensemble Classifier (HEC) (combination of ANN, DT, BN, and SVM), Bayesian Regularization (BR), Scaled Conjugate Gradient (SCG), BFGS Quasi-Newton (QN), Levenberg–Marquardt (LM), Variant based Ensemble Learning (VEL), Artificial Bee Colony (ABC) optimization technique, Principal Component Analysis (PCA) and Principal Component-based Support Vector Machine (PC-SVM). In a few studies, the accuracies are analyzed for each program category. To make the analysis more fruitful, the mean of accuracies for each class is considered for a more reliable estimation of the performances. The model’s performances are assessed for the standalone implementation identified by FEDRak (L) and the federated learning model-based model identified by FEDRak (G).
The performances of the proposed model are performed concerning the standard evaluation parameters like True Positive (
), True Negative (
), False Positive (
) and False Negative (
) [
33]. The number of times the proposed model can precisely identify the erroneous code block correctly is determined as a true positive, and the number of instances correctly identifying error-free blocks as non-suspicious is defined as a true negative. The number of cases in which the proposed model misinterprets the erroneous code blocks as non-erroneous ones are considered false positives. The number of instances in which the proposed model misinterprets the non-erroneous code blocks as erroneous code blocks are considered false negatives [
34]. The sensitivity determined the ability of the model to correctly identify the erroneous code block, and the corresponding formula is shown in Equation (22).
The specificity determined the ability to recognize the non-erroneous code block correctly. The corresponding formula is shown in Equation (23).
Accuracy is the other significant performance assessment metric, which summarizes the total number of accurately detected instances. It is most often used when all classes of datasets are equally essential. The corresponding formula for accuracy is shown in Equation (24).
The F1 score evaluates a model’s prediction capacity by focusing on its class-wise efficiency instead of its overall performance. The F1-score is a harmonic mean of precision and sensitivity. The corresponding formula for F1-score is shown in Equation (25), and the formula for precision is shown in Equation (26).
The performance of the proposed model is analyzed across each category of the programs, in both local and global models. In some of the existing studies, the models are not evaluated concerning some of the metrics. All such instances are mentioned as not applicable (N/A) in the performance analysis tables. The obtained outcome for each category and the corresponding values are shown in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11. The mean performances of local and global models in each program category, along with Unix Utility programs, are shown in
Table 12.
The above tables show the proposed model’s performance in relation to individual programs and the overall performance of the proposed model in executing over a standalone environment and a federated learning environment. It can be seen from the experimental results that standalone learning has exhibited better performance than federated learning. Their performances in federated learning are comparatively lower due to the standard features among the programs that have different feature ranks among divergent program categories that would result in compromised accuracies in the global model. The performances for each program class are evaluated against the existing models, and the performances of each program category are evaluated as shown in
Table 13. PC4 of the NASA-MDP repository has attained the highest accuracy of 97.9%, and Sed of the Unix Utility program has the least accuracy of 89%. The mean accuracy for all categories of programs is 93.7%. The graphs representing the resultant outcome for all programs are shown in
Figure 4.
The performance of the proposed software fault forecasting model is further evaluated concerning metrics like Receiver Operator Characteristic (ROC) [
38]. The ROC curve is especially useful with binary classes because it shows the probability relation between the true positive rate and the false positive rate. The ROC curve is generated by computing and graphing the rate of true positives relative to the rate of false positives for a singular classifier across multiple thresholds. The formula for the ROC curve is shown in Equation (27).
From the above equation, the notation
designates the number of positive samples, and the notation
designates the number of negative samples in the entire dataset. The corresponding ROC curves for NASA-MDP programs are shown in
Figure 5, the ROC curves for PROMISE programs are shown in
Figure 6, and the ROC curves for Unix Utility programs are shown in
Figure 7.
5.1. Performance Analysis
The cumulative performance of the proposed model concerning all the classes is summarized and analyzed against the performance of existing studies for fault localization, as shown in
Table 12.
Table 12.
Performance analysis of various faults.
Table 12.
Performance analysis of various faults.
Approach | Sensitivity | Specificity | Accuracy |
---|
NB [25] | 0.356 | 0.922 | 0.833 |
ANN [25] | 0.186 | 0.957 | 0.858 |
DT [25] | 0.203 | 0.938 | 0.850 |
FLF [25] | 0.328 | 0.989 | 0.910 |
FEDRak | 0.934 | 0.934 | 0.937 |
As can be observed from the experimental analysis shown in the above table, the proposed fault forecasting model has shown reasonable performance. The accuracies of various algorithms across the NASA-MDP and PROMISE repositories are summarized in
Table 13.
Table 13.
Performance analysis of various faults.
Table 13.
Performance analysis of various faults.
Approach | Dataset | Accuracy |
---|
NB [25] | NASA-MDP | 0.833 |
ANN [25] | NASA-MDP | 0.858 |
DT [25] | NASA-MDP | 0.850 |
FLF [25] | NASA-MDP | 0.910 |
BN [37] | NASA-MDP | 0.713 |
DT [37] | NASA-MDP | 0.771 |
BN + SMOT [37] | NASA-MDP | 0.786 |
DT + SMOT [37] | NASA-MDP | 0.810 |
ANN [39] | PROMISE | 0.865 |
SVM [39] | PROMISE | 0.854 |
NB [39] | PROMISE | 0.850 |
TREE [39] | PROMISE | 0.836 |
KNN [39] | PROMISE | 0.612 |
SVM [40] | PROMISE | 0.772 |
MLP [40] | PROMISE | 0.788 |
RBF [40] | PROMISE | 0.795 |
VEL [40] | PROMISE | 0.844 |
NB-PCA [41] | PROMISE | 0.810 |
SVM-PCA [41] | PROMISE | 0.830 |
RF-PCA [41] | PROMISE | 0.830 |
RF-Adaboost [42] | PROMISE | 0.900 |
SVM-Adaboost [42] | PROMISE | 0.790 |
Adaboost-RF [43] | PROMISE | 0.897 |
Bag-RF [43] | PROMISE | 0.897 |
FEDRak | MDP, PROMISE | 0.952 |
As can be observed from
Table 12 and
Table 13, the proposed fine-tuned spectrum-based fault localization technique has outperformed that of the conventional fault-localization techniques. The average performances concerning various evaluation metrics of the local and global models are considered as the performance of the proposed model compared with other state-of-the-art techniques.
5.2. Threats to Validity
Empirical findings demonstrate that the approach posited in this investigation exhibits superior performance in inter-program defect forecasting. However, certain variables and plausible hazards impinge on the method’s validity. Acquiring extensive project datasets that contain defect labels can be a challenging task. Only a limited number of datasets from NASA-MDP, PROMISE, and Unix Utility Programs have been utilized to conduct comparison experiments. To enhance the validity and reliability of the divergent defect model, it is recommended that additional software defect datasets from multiple companies be utilized in future research endeavors.
The metrics employed as autonomous factors for forecasting software defects have detected a potential internal fault. Multiple datasets were used from various repositories, each with distinct metrics and granularity levels such as method, class, or file. However, multiple datasets are being used in the current study to address this potential risk. The potential for external threats exists when generalizing the conclusions derived from various client devices. The findings presented in this study are derived from datasets generated by multiple researchers. It is important to note that discrepancies in the measurement techniques employed by these teams may impact the accuracy and reliability of the results. To address this potential risk, a deliberate selection process was used to identify datasets encompassing diverse implementation aspects, varying in scale and level of detail.
6. Conclusions
The Software fault forecasting model presented in the current study has performed exceptionally well in identifying the code blocks with possible code errors. The model works over the fine-tuned spectrum-based fault localization technique, assessing the symmetry of normal and erroneous reference code features. The feature selection and scaling, followed by information extraction, are performed to forecast the code blocks precisely. The feature weights are synchronized with the global model through the federated learning model upon building the standalone local model. The performance of the standalone and global models for every program category is analyzed, and it is observed that the proposed approach has outperformed the other contemporary approaches used in fault localization. This study is confined to a limited number of program classes in the NASA-MDP and PROMISE repositories. The previous studies in the review process included all of the programs in the repositories; the current research may be assessed further using all types of programs in the repositories. The performance of the proposed model can be further evaluated using divergent datasets for fault localization over multiple clients to evaluate the federated learning model’s performance precisely. The time delay is the other crucial parameter in the federated learning mechanism that has to be assessed in future studies.
The future research directions also include security constraints like data confidentiality and integrity mechanisms for the data exchanged between the clients and the server in the federated learning environment. The optimization of weights is expected to result in improved performance. At the same time, including auxiliary memory elements for state information maintenance is anticipated to enhance the efficiency of the fault forecasting approaches.