1. Introduction
Froth flotation is a physicochemical separation of economically valuable minerals of interest from their gangue [
1,
2]. This separation process occurs in organised cells in which the feed material (i.e., ore) is treated until the valuable minerals are sufficiently recovered. In most industrial operations, sensors are used to measure key parameters of the flotation process, leading to the production of large volumes of data for analysis. Recent advances in machine learning (ML) application offer opportunities to effectively use flotation data to design predictive and process control models for process optimisation. However, sensed flotation data are prone to quality issues, mainly outliers that compromise the reliability of the data and the accuracy of models derived from them. To leverage valuable insights from flotation data analytics, it is critical to have high-quality data to enable ML models to learn meaningful relationships to effectively monitor control systems, improve performance, and optimise processes.
Enhancing data quality is necessary, as outliers can interfere with experimental analysis leading to biased predictions, misleading insights, and reduced generalisation [
3]. Outliers may not always be bad observations in the dataset. It is worth mentioning that outliers can have exceptional information, in which case further investigation may be needed to ascertain their inclusion or removal from the dataset. As such, researchers scrutinise outliers to understand the factors that contributed to their generation or unique circumstances that might have influenced their existence. This has facilitated the application of outlier detection across several domains, including fraud detection [
4], network intrusion [
5], disease diagnosis [
6], and fault detection [
7]. Despite its acknowledged significance in diverse fields, outlier detection has not received adequate attention in mineral processing data analytics, representing a relatively under-explored topic. This limited focus can be attributed to (1) outliers often perceived as errors to be discarded rather than interesting behaviours worth investigating, (2) the inherent complexity of data, which makes it challenging to accurately identify outliers, and (3) the lack of domain-specific methods for the identification and interpretation of outliers.
Outliers are observations that deviate from a body of data [
8,
9]. They can generally be classified into three main categories, namely point outliers, collective outliers, and contextual outliers. Point outliers refer to observations that deviate extremely from the overall distribution of a dataset [
10]. Collective outliers describe a group of observations that deviate from the distribution of the dataset [
11]. Contextual outliers refer to observations that are extremely different in a specific context [
9,
12]. For example, a summer temperature of 30 °C is normal but likely to be an outlier when recorded during winter. Within the mineral processing industry, factors such as faulty sensors, equipment malfunction, improper handling of missing data values, and unexpected fluctuations can produce any of these types of outliers in the production data [
13,
14]. As such, outliers should be carefully investigated using appropriate methods to effectively monitor process equipment and the data they generate. More importantly, outliers should be properly managed before making decisions based on analysis of the production data.
The flotation data represent dynamic relationships of key variables including feed variables (feed mineralogy, particle size, throughput, liberation), hydrodynamic variables (bubble size, air flow rate, froth depth, impeller speed), and chemical variables (reagent dosages, pulp chemical conditions). The interdependence of these variables makes it arduous to justify an observation as an outlier within the intricate web of relationships it shares with other variables. For instance, a decrease in Eh values in a flotation pulp measurement may not be an outlier observation. Instead, it may be attributable to an elevated iron sulphide content in the feed [
15]. In addition, during comminution, changes that occur in mineralogy and grinding media can impact significant changes in the pulp chemistry of flotation feed [
16,
17]. Again, these changes may not be outliers. Furthermore, sensors used in harsh mineral processing environments may experience a breakdowns or failures, yet they may continue to record data from the operation, leading to compromised and potentially inaccurate readings [
18]. Such variable associations and equipment conditions complicate the distinction between instabilities and outliers in the flotation data. To enhance the quality of flotation data, methods for outlier detection should be critically explored while considering the intricate relationships among multiple variables.
Studies on outlier detection spans several decades and can be broadly categorised as (1) statistical-based, (2) distance-based, (3) density-based, and (4) prediction-based techniques [
19]. Statistical methods such as Grubb’s test [
20], Doerffel’s test [
21], Dixon’s test [
3], Peirce’s test [
22] and Chauvenet’s test [
23] are well known and efficient in detecting point outliers, especially those that occur in univariate datasets. Other works [
24,
25,
26] have reported robust statistical methods of assessing outliers.
In recent years, a boxplot [
27] technique for outlier detection has gained popularity in engineering domains. The boxplot utilises a concept of interquartile range (IQR) to visualise outliers. The
is computed as
, where
is the first quartile and
is the third quartile such that observations beyond the range
to
are considered potential outliers [
28]. Other studies have used the minimum covariance determinant (MCD) and the minimum volume ellipsoid (MVE) to analyse multivariate data for outliers [
12]. However, both MCD and MVE have some limitations as they become ineffective if the data dimension increases. Although statistically-based methods are easy to implement, they are mostly sensitive to outliers, as their computation relies on the properties of the mean, median, and standard deviation of the data. In addition, their concept follows an underlying assumption of normally distributed data, which is often not the case in real-world data. Furthermore, they are ineffective in detecting multivariate outliers, especially those occurring in high-dimension datasets.
Alternatively, distance-based methods [
29,
30] offer solutions to mitigate the limitations of statistical methods. Distance-based methods use distance metrics such as the Euclidean distance to calculate the distance between observations and identify outliers based on these distance relationships. Knorr and Ng [
29] proposed a classical distance-based outlier detection technique. They defined a unified notion of outliers as follows. An object
O in a dataset
T is a
UO(p, D)-outlier if at least fraction p of the objects in
T are ≥distance
D from
O [
29]. Ramaswamy et al. [
31] improved this concept by computing the distance from the k-Nearest Neighbour (kNN) of observations and considered potential outliers as observations that fell beyond a specified neighbourhood. Distance-based methods have several drawbacks, including (1) the assumption that data are uniformly distributed, which may not hold for heterogeneous data with varying distributions, (2) algorithm complexities which arise with high-dimension datasets, and (3) an ineffective detection of outliers existing within dense cluster regions.
To overcome the shortfalls of distance-based methods, researchers have explored density-based outlier detection methods [
32,
33]. The most widely used density-based method is the Local Outlier Factor (LOF) [
34]. It adopts the concept of comparing the local density of an observation to the density of its neighbours. An observation is considered an outlier if it lies in a lower-density region compared to the local density of its neighbours. A score is computed to describe the degree of ‘
outlierness’. This score is used to identify the exceptions in the dataset whose divergence is not easily detected as well as those that exist in high-dimensional subspaces [
35,
36]. Recently, several variants of the LOF have been explored, including Local Outlier Probability (LoOP) [
37], Local Correlation Integral (LOCI) [
38], Local Sparsity Coefficient (LSC) [
39], and Local Distance-based Outlier Factor (LDOF) [
40]. Although density-based methods can capture local outliers, they tend to be ineffective when low-density patterns occur in a given dataset [
41,
42].
The task of detecting and confirming outliers in the flotation data is not straightforward given the complexities associated with multiple variables as well as the diverse principles underlying various detection methods. Individual methods are effective only if their principles of detection apply. This means different methods would detect different outliers. As such, it is unclear what method to use and what threshold to set.
In this research, we propose an approach to conduct outlier detection in flotation data, addressing two main challenges in complex industrial processes: (1) The first is the presence of atypical data points that fall within the range of normal observations but represent anomalous process conditions. These points, while numerically similar to normal data, may indicate subtle deviations in the flotation process that are important to identify. and (2) The second is the multidimensional nature of outliers in flotation data, where observations may appear normal when viewed from one perspective (or in one dimension) but exhibit anomalous behaviour when considered in the context of other variables. Our approach consists of four parts. First, a standard deviation factor of the outlier scores is used to determine which observations in the data are outliers. Second, we use a naive algorithm called trend differential to identify quasi-outliers, including observations that visually form sharp peaks on the input features. Thirdly, we use different machine learning (ML) algorithms to identify outliers in the dataset from different perspectives. Fourthly, we analyse the coverage of quasi-outliers by outliers from the ML algorithms to confirm valid outliers and determine the effectiveness of the ML algorithms. The ML algorithms used in our work include k-Nearest Neighbour (kNN), Local Outlier Factor (LOF), and Isolation Forest (ISF).
Our approach addresses two key questions: (1) should multiple methods be used in detecting outliers and (2) how should the methods and their results be compared?
The contributions of this study are as follows:
The standard deviation factor of two (2) is verified to be a suitable value to define the threshold for outlier detection.
A method called trend differential is proposed to systematically identify visual outliers called quasi-outliers. These outliers are important as a starting point for our outlier detection work.
An analysis of the coverage of outliers from different methods to examine the consistency of these methods. Our results show that the outliers by the kNN algorithm cover most of the outliers by other methods, making it the most effective.
An analysis of the effect of outliers on model building. The result of the analysis shows that outliers can degrade the predictive power of predictive models by increasing prediction errors.
The remainder of this paper is organised as follows. We present in
Section 2 the collection and preprocessing of the sensed flotation data used in this study. In
Section 3, we present the outlier detection methods used in this study. In
Section 4, we present the results and findings of this work. Finally, we draw our conclusions in
Section 5.
5. Conclusions
This study introduced a novel ‘trend differential’ approach combined with a 2
standard deviation factor to identify quasi-outliers in industrial flotation data. The effectiveness of this method was then validated using established outlier detection algorithms (kNN, LOF, and ISF). While our approach successfully captured a majority of the most significant outliers in the dataset, it is important to critically examine the implications and limitations of these findings. The visualisation of quasi-outliers revealed significant trend breaks across multiple variables, suggesting that our method can detect complex, multivariate anomalies. This aligns with previous research by Hodge and Austin [
54], who emphasised the importance of considering multiple dimensions in outlier detection for industrial processes. However, the precise nature of these trend breaks and their root causes in the flotation process warrant further investigation.
Our introduction of a 5 % control limit to capture rare observations proved effective in identifying outliers, but it is crucial to consider the potential trade-offs. As pointed out by Aggarwal [
55], there is always a risk of misclassifying legitimate rare events as outliers, which could lead to a loss of valuable information in process optimisation. Future work should explore adaptive thresholding techniques that can adjust to varying process conditions, as suggested by Liu et al. [
56]. The observation that outliers occur in diverse directions within the dataset underscores the complexity of flotation processes and the challenges in outlier detection. This multidimensional nature of outliers aligns with findings by Markou and Singh [
57], who highlighted the need for sophisticated, context-aware outlier detection methods in complex industrial settings.
Our evaluation of model prediction performance with and without outliers demonstrated their significant impact on prediction accuracy. While this supports the importance of outlier detection and removal for accurate modelling, it also raises questions about the potential loss of important process information. As cautioned by Rousseeuw and Hubert [
58], an indiscriminate removal of outliers can lead to model overfitting and reduced generalisability.
The limitation of this study to three outlier detection algorithms, while providing valuable insights, also highlights the need for a more comprehensive comparison of methods. Future work could explore the application of deep learning techniques, such as autoencoders [
59], which have shown promise in handling high-dimensional data typical in industrial processes. Moreover, the potential for real-time outlier detection and its integration into process control systems remains an exciting avenue for future research. As suggested by Ge et al. [
60], the development of adaptive, online outlier detection methods could significantly enhance process monitoring and control in mineral processing operations. While our ‘
Trend differential’ approach shows promise in identifying complex outliers in flotation data, its practical implementation requires a careful consideration of process-specific factors and potential information loss. Future research should focus on developing more adaptive, context-aware outlier detection methods and exploring their integration with robust modelling techniques to enhance both the accuracy and interpretability of flotation process models.
The following conclusions can be drawn from this study:
The outlier detection algorithms are effective in enhancing data quality, and their performance was assessed. The kNN algorithm performed best compared to LOF and ISF in terms of the number of quasi-outliers detected and covered, as kNN ranks the majority of the worst outliers as top outliers. The effectiveness of the detection algorithms can be ordered as .
Training data containing outliers can cause predictive models to make larger errors on non-outlier input records. The study showed that outliers have detrimental effects on prediction performance compared to ‘normal’ observations. This negative impact of outliers should not be overlooked as they produce inaccurate performance outcomes, especially in high-dimensional data.
The dynamic nature of flotation processes makes distinguishing ‘normal’ observations from outliers complex. Analysts should avoid rigidly applying predetermined thresholds for outlier detection without thorough investigations and consultation with industry experts. It is essential to assess the degree of outlier behaviour in flotation data using both analytical methods and domain knowledge to enhance data quality.
This research is highly significant to both the research community and the mineral processing industry. It demonstrates that unsupervised ML algorithms are effective in analysing data from flotation operations. These algorithms can detect outliers, enhance data quality for predictive analysis, and improve process optimisation for future planning and decision making.