1. Introduction
Water consists of more than two-thirds of the Earth’s surface; it is a crucial resource for living. However, even with its abundance, the usable amount of water is limited [
1]. water has an important position on the Earth. It is so typical in our daily lives that we usually ignore its essence. The water character is dynamic. It can change in both physical and chemical characteristics with time.
Water is a significant component of all living beings. All living cells are constructed of aquatic solutions, suspensions, and emulsions with water in the range of 25–85% [
2]. Youthful tissues hold more water than the ancient ones. In general, plant life also depends on water. They take water and other mineral salts from soils. Plants comprise 50–75% water, while in humans, water constitutes 60–65% in males and 50–60% in females [
3]. It is discreetly affected in operating systems and contributes to the synthesis and decomposition of several organic compounds. Natural water includes minerals, organic pollutants, and dissolved gases such as air and carbon dioxide [
4].
Furthermore, multiple ailments transfer through water. Therefore, real-time tracking of water quality (WQ) becomes an essential need [
5]. Water quality may vary due to biological and anthropogenic operations that happen in certain circumstances. Some of these operations are the cause of modifications in water quality due to increasing temperatures, biochemical oxygen consumption, chemical oxygen utilization, and increased nitrogen and phosphorus combinations [
6]. Recently, water pollution in the Republic of Kosovo has evolved into a fundamental problem that contains water’s chemical, physical and biological elements. Water availability and quality (surface or ground) have been corrupted because of essential elements such as growing population, automation, urbanization, and others. Surface water quality in current years in Kosovo is corrupt, primarily after release of urban and industrial wastewater and pollution from agriculture [
7].
Typically, water can be split into ground and surface water [
8]. These types of water are released as pollution risks from agricultural, industrial, and domestic activities and other sources. These sources may contain pollutants such as heavy metals, pesticides, fertilizers, hazardous chemicals, and oils [
9].
Water quality can be divided into four main types: potable water, palatable water, contaminated (polluted) water, and infected water [
10]. Below are the most typical scientific definitions of these types of water quality [
11]:
Potable water is safe to drink, pleasant to taste, and is used for domestic pursuits.
Palatable water can contain chemical components that do not generate a threat to human health. It is gorgeous and pleasing [
10].
Contaminated (polluted) water may include unwanted physical, chemical, biological, or radiological substances. It is inappropriate for drinking or domestic use.
Infected water is polluted with pathogenic organisms.
WQ parameters can be divided into physical, chemical, and biological [
12], identified in
Table 1.
Commonly, considering WQ entails gathering water samples from different sites at particular time intervals and analyzing them in laboratories. However, manual sampling and laboratory study of WQ for any given water body or process can be ineffective, costly, and time-consuming. Currently, intelligent systems are hugely used to observe WQ, especially when real-time data are required [
10,
13].
Machine learning (ML) is an artificial intelligence (AI) branch that allows applications to predict outcomes more accurately without being explicitly programmed. It is one of the most hopeful applications in information technology, whose application area is endless [
14,
15]. ML applied in the education area is presently admirable to researchers and scientists. Applications in the education sector include a student performance prediction [
16], students testing and grading fairly [
17,
18], and teachers and staff support [
19].
Recently, ML has been widely utilized in finance and marketing applications to solve the challenges in their respective fields. ML and especially decision support systems can enhance community performance by analyzing the ground reality. Due to competitors, costs, tax pressures, and other factors, a turbulent economy becomes typical for every organization. Privatization, globalization, and liberalization pull the organization into a more complex competitive environment. Therefore, to perform desired gain, organizations need appropriate marketing strategies. In addition, the marketing decision support system assists in decreasing organization burdens by studying and planning based on its efficient ML approach [
20].
ML is also used to predict the turnover of employees within an organization. The employee turnover rate can highly influence the organization’s performance. For this reason, the ability to expect employee turnover has recently been an invaluable tool for any organization seeking to retain employees and predict their future behavior [
21].
Because of the growth in Internet use during the past years, cyberattacks have grown, causing an increase in private information and financial loss. Cyberattacks contain phishing, spamming, and malware. Therefore, detecting and avoiding cyberattacks has become essential to increase security [
22]. ML is widely used in mail classification. More than 300 billion emails are sent daily. Almost half of them are spam. One of the main jobs of email providers is filtering out spam. Spam detection is confusing. The difference between spam and non-spam messages is unclear, and the criteria vary over time. From different actions to automate spam detection, machine learning has been demonstrated to be a practical and preferred approach by email providers [
23]. In addition, ML has been considered an essential technique for treating the enormous growth and complexity of cybersecurity threats. The ML technique system can identify patterns to catch malware and unusual activity better than humans and traditional software [
24,
25].
ML techniques have been used in innovative city development [
26]. They are integrated into many applications in smart cities, such as smart parking [
27], power generation [
28], healthcare [
29], and water and air pollution detection [
30]. Accurate forecasting of photovoltaic (PV) generation is critical to producing and planning PV-intensive power systems because of the inherent intermittency of solar power. Several machine learning-based PV forecasting methods have recently emerged [
31].
Moreover, ML is applied to air quality detection [
32]. Human survival would be impossible without air. Consistent developments in almost every aspect of modern human society have harmed the health of the air. Everyday industrial transportation and domestic activities introduce dangerous pollutants into our environment. Monitoring and forecasting air quality has become critical in this era, particularly in developing countries such as India. In contrast to traditional methods, machine learning-based prediction technologies have proven to be the most effective tools for studying such modern hazards.
Additionally, the healthcare sector has taken advantage of ML advancement to elaborate it in medicine and administrative activities [
33,
34]. ML technology permits the construction of models that can quickly interpret data and produce results. Thus, doctors could make good decisions on patient diagnosis and treatment options more efficiently. It may lead to enhanced patient health care services [
25]. Machine learning assists in making instructed clinical decisions by using past data and the knowledge of evidence-based medicine. ML furnishes techniques to explore and expose complex relationships that are difficult to convert into an equation. Healthcare communication is an essential task in the healthcare system. It is responsible for tactfully translating and sharing information to help and apprise patients and the public. ML is confirmed valid in healthcare with the capacity for complex dialogue control and windy flexibility [
35].
Machine learning (ML) has a recently vital role in enhancing the state of play in all professional sports, particularly football. It allows football management to predict the success of the matches via a detailed study of data, ML modeling, and much more. It assists in enhancing their state of play and building strategies that will produce promising [
36].
This paper evaluates different supervised machine learning algorithms on a public dataset of drinking water quality [
37]. The aim is to identify the highest performance- classifying model on the given dataset and the most reliable results to be used for further applications.
The paper is divided as follows:
Section 2 explores some recent ML applications in different sectors.
Section 3 elaborates the dataset used in this study with its corresponding features and the ML algorithms applied to the dataset. Afterward, obtained results are discussed in
Section 4 and are followed by a reasonable discussion about the performance of the ML algorithms. Finally,
Section 5 concludes the work strategy of this paper with a recommendation for future work.
2. Literature Review
ML has recently been used in many areas, including education, healthcare, power systems, security, air quality, and renewable systems [
38]. In education, ML has played an essential role, especially during the corona pandemic. ML techniques made the online examination during the lockdown period possible. In [
18], the authors systematically review the ML function in exam management systems during this period. This review was directed by assessing around 135 studies during the last five years. The importance of ML in the whole exam cycle from pre-exam practice, control of examination, and assessment were reviewed and examined. The unsupervised or supervised ML algorithms were determined and classified during each process. This review resumes all the problems and challenges of using machine learning in the examination system. These problems and challenges are discussed with their solutions. The vast development of information technologies and the enormous increase in computer usage in schools have produced innovations in test structure and examination. In [
17], the study aimed to automatically detect the most informative subset of test items in evaluating the examinees without decreasing accuracy. For this purpose, the authors proposed a new approach to employing abductive network modeling.
Phishing is an endeavor to rob private information or damage online accounts using misleading emails, messages, or sites that look familiar. Sometimes, a phishing email may contain links that allow for the download of malicious software on users’ computers when clicking on it. Such emails should be noticed by spam filters that generally categorize an email with a link as a phishing email. Regardless, emails that do not include links, called link-less emails, need more action from the spam filters. In [
39], a real-time anti-phishing system has been proposed. This system uses seven classification algorithms, and natural language processing (NLP)-based features. The system contains the following distinguishing properties from other studies in the literature: language independence, use of the enormous size of phishing and legitimate data, real-time execution, detection of new websites, independence from third-party services, and use of feature-rich classifiers. The authors constructed a new dataset to measure the performance of an ML detection algorithm. From the experiment, the authors conclude that the random forest algorithm with only NLP-based features gives the best accuracy with a 97.98% rate for detecting phishing URLs.
The authors in [
23] focused on categorizing link-less emails using the ML approach and deep neural networks. The proposed approach was tested using accurate data. Accordingly, the results indicated that the deep neural network functioned well in detecting phishing emails. In addition, ML has been used in detecting and predicting air quality. In [
40], the authors reviewed the studies on air pollution prediction using machine learning algorithms based on sensor data in the context of intelligent cities because of the increase in machine learning techniques and their entry into all fields, especially air pollution forecasting. After a comprehensive review of the most relevant papers regarding using the most famous databases and executing the corresponding filter, the main features were extracted to a link and were compared. As a result, they concluded that they applied advanced and sophisticated techniques instead of simple ML techniques. Moreover, the main prediction target was the particulate matter with a diameter of 2.5 micrometers. However, for efficient air quality prediction, it is important to consider external factors such as weather conditions, spatial characteristics, and temporal features.
In [
25], several machine learning algorithms have been reviewed for creating efficient decision support for healthcare applications. This review helped decrease the research gap for creating an efficient decision support system for medical applications. In [
33], the authors propose a classification method for Parkinson’s disease with the support of human voice signals. In this study, six different machine learning (ML) algorithms are used in the classification steps. The study sought to classify Parkinson’s disease based on human voice signals and pull essential elements only in the process to reduce the dataset complexity. Thus, voice signals are examined to check the human voice intensity and spectrum of Parkinson’s disease patients. Afterward, machine learning classifiers were applied to categorize them based on extracted features. The results indicate that the random forest classifier has the highest accuracy with a value of 97%, followed by the extreme gradient boosting, k-nearest neighbor, and decision tree classifiers with a value of 95%.
The authors of [
41] presented numerous flaws in healthcare’s current evidence-based approaches. However, they showed how insufficient biased evidence lead to ineffective care. They investigated the potential for data science and artificial intelligence in addressing healthcare ethical concerns. Finally, they offered policy recommendations for ML reform in the healthcare sector, which can aid in developing beneficial systems.
In [
29], a hybrid learning-based classifier was applied to an MRI dataset of amiable and abnormal images in the medical sector, in addition to deep learning algorithms. Compared to supervised ML approaches, this study proposed an approach that exceeded the existing approaches in the literature based on their experimental results.
ML is also used to achieve convenient management and treatment in desalination plants [
41]. In this study, the authors proposed an optimization system to ensure modular and cost-effective treatment for small industries. The system applies water vaporizing from saline liquid films that have been removed by surface evaporation based on the differences in vapor pressure produced by forced air convection. The optimization determines the optimal operating settings and allows for a comprehensive examination of the effect of diverse operational decisions on operating cost, capital cost, and footprint area.
In [
13], a new real-time technique to survey water quality was proposed. It uses electromagnetic sensors and ML techniques. The integrated multi-sensing device is used to measure several water quality parameters, including oxidation reduction potential, carbon dioxide gas, temperature, conductivity, and pH. A Vector Network Analyzer has been used to deliver various parameters such as S11 that operates at the 50 kHz–3 GHz frequency range. Changes in water samples were recorded and analyzed. Afterward, changes in water impurities were detected using ML techniques.
The authors in [
42] studied the performance of AI techniques including artificial neural network and support vector machine in predicting water quality components. All sampling data were gathered from an Iranian river. In applying the AI techniques, several types of transfer and kernel functions were tried. As a result, the authors concluded that the examined AI techniques are suitable for predicting water quality components.
The authors in [
43] examined an appropriate classification model for water quality based on ML techniques. Their study compared the performance of several classification models and algorithms to determine the significant features in classifying the water quality. All samples were collected from a Malaysian river. Five models with respective algorithms were applied and compared, showing that the lazy model using the K-Star algorithm has the best accuracy with an accuracy value of around 87%.
Two new techniques of ML based on the decision tree were proposed in [
44]. These techniques provided more accurate results on water quality prediction in the short-term period. The authors proposed the extreme gradient boosting and random forest techniques that provide an advanced data denoising technique. Water samples were taken from the Tualatin river in the United States. Six water quality indicators were annotated in the proposed models, including temperature, dissolved oxygen, and pH value.
In [
45], the authors have compared the water quality prediction performance of ten ML models (seven traditional and three ensemble models) using big data that were collected from Chinese rivers and recorded between 2012 and 2018. Four primary performance metrics were recorded: precision, recall, F1-score, and weighted F1-score to explore the potential key water parameters in the future. The obtained results show that extensive data enhance the ML models’ performance in predicting water quality. The dataset features include pH, DO and NH3-N.
In this study, a dataset with different characteristics is studied and examined to illustrate the water quality level. The following section is intended to explain the methodology of this paper.
4. Results and Discussion
After the supervised learning algorithms are applied to the dataset, the classification models’ results are computed using performance metrics. This section displays the confusion matrices for the ML algorithms studied in this paper as well as the performance measures to determine the best algorithms for differentiating potable/non-potable water.
The general form for a confusion matrix is represented in
Table 14 where “1” means that the water is potable and “0” means that the water is not potable.
The four notations represented in the matrix are read as follows:
tp is the number of records that are predicted to be potable and actually are potable.
fp is the number of records that are predicted to be not potable but actually are potable.
fn is the number of records that are predicted to be potable but actually are not potable.
tn is the number of records that are predicted to be not potable and actually are not potable.
After applying the dataset to each of the algorithms represented in
Section 3, the confusion matrices are produced and are represented in
Table 15,
Table 16,
Table 17,
Table 18,
Table 19,
Table 20,
Table 21,
Table 22,
Table 23,
Table 24 and
Table 25.
In this proposed approach, machine learning models are illustrated based on whether water quality is potable or not. The performance measurements evaluated in this study are F1-score and ROC AUC. F1-score is calculated in terms of Precision and Recall. The precision of an ML classifier represents the number of samples that are portable out of the total samples the model retrieved. It can be computed using the following expression:
However, the recall represents the total number of samples that the ML model correctly identified as portable out of the total portable samples. It is calculated using the following formula:
The F1-score can be computed based on Recall and precision values. It is a simple representation of the harmonic mean of Precision and Recall.
The ROC AUC—a metric that considers the capability of distinguishing classes—is computed in terms of Sensitivity and Recall. Recall is already defined in (2), whereas Sensitivity is computed as follows:
ROC is a plot of correct predictions (Recall) for the positive class versus the fraction of errors (Sensitivity) for the negative class. As a result,
Table 26 shows the metric values for Precision, Recall, F1-score, and ROC AUC.
In terms of precision, the KNN classifier is the better classifier with a value equal to 50.8%. It represents the ability of the ML classifier model to identify the portable data points. Therefore, among all the used classifiers, the KNN model is better in the identification of portable data points. KNN is adequate when there is a clear margin of separation between classes and when the noise in collected data is very low.
However, based on recall values, the LL and SGD models have a value of 100%. Thus, using these models, there are no records that are predicted to be potable, but in reality, they are not potable. Moreover, many used algorithms have a good performance according to recall values such as RF, KNN, ANN and LR.
F1-score is a critical evaluation metric in machine learning. It elegantly summarizes a model’s predictive performance by combining two otherwise opposing metrics—precision and recall. It considers precision and recall, which means it accounts for FPs and FNs. The higher the F1-score, the higher the precision and recall. According to
Table 26, SVM and ANN perform better than the other algorithms, with higher F1-score values of 63.8% and 63.9%, respectively. In addition, other models such as ExT, RF and KNN have an acceptable F1-score value. ANN and SVM are useful in the case of a large amount of datasets. In addition, they are efficient in detecting complex relationships of dependent/independent variables.
Similarly, AUC ROC measures a classifier’s capability in determining records to their corresponding classes. The greater the AUC, the better the performance in differentiating positive and negative classes. Based on
Table 26, SVM and ANN perform better than the other algorithms, with higher F1-score values of 63.8% and 63.9%, respectively. Moreover, they have higher ROC AUC accuracy values of 0.731 and 0.726, respectively. Thus, they are better than other models based on ROC AUC values.
5. Conclusions and Future Work
Water is linked to sixteen of the United Nations Sustainable Development Goals. Access to safe drinking water is essential for good health, a fundamental human right, and a component of effective health-protection policies. Clean water is a critical issue for both health and development. Investments in water supply and sanitation have been shown to produce a net economic advantage in some areas because they reduce adverse health effects and medical costs more than they cost to implement. However, a variety of pollutants are wreaking havoc on drinking water quality. In this paper, the efficiency of using ML algorithms in water pollution problems was studied. As a result, this paper considers many water characteristics for water quality prediction using machine learning algorithms. Environmental problem automation is critical for decision accuracy, long-term planning, and faster action.
This study compared various ML algorithm performances on a dataset of drinking water quality. Afterward, the results were compared to determine the best machine learning algorithm for water quality classification. Thus, using a real dataset, a machine learning classifier model was created in this research to predict the water quality. Significant features were chosen first. The dataset being used included all measurable characteristics. Subsets of data for training and testing were created. Applying a number of currently available ML algorithms, the outcomes were contrasted in terms of precision, recall, F1 score, and ROC curve. According to F1-score and ROC AUC values, the results demonstrate that the support vector machine and k-nearest neighbor are superior.
In future work, the proposed approach will be modified to enhance the performance of these algorithms. Hyperparameter tuning can be performed on each algorithm to find the best model setup to obtain the most optimized result.