1. Introduction
Ensuring an efficient and smooth flow of production processes can be challenging, time-consuming, and, at times, also problematic. For example, in the wood industry, from the many tasks that need to be monitored, some of them will require specialized knowledge and precision, while others will use up a significant amount of time, and there are quite a lot of activities that combine all of those features. One of such tasks concerns evaluating the state of drills in the manufacturing process, which is a subset of problems widely known as tool condition monitoring. Usually, when manually performed, this task requires stopping the production process in order to evaluate individual tools. At the same time, a human expert is required to check used elements, without any indication to its actual state. Due to that, unnecessary downtime may occur, when it could have been avoided if the entire process had been, at least partially, automated.
When it comes to tool evaluation, many different approaches have been considered to either speed up the process, or avoid human intervention in general. For example, the main focus may include evaluating the state of elements without interrupting the actual manufacturing process, as presented in [
1]. Most basic and commonly used approaches, such as the one presented in [
2], measure different signals, such as vibration, noise, acoustic emission, cutting torque, feed force, and others, in order to evaluate the tool state. Similar approaches were used in [
3], where data were extracted both from signal and frequency domains, along with wavelet coefficients, all in order to evaluate the obtained elements automatically, checking how relevant each item was to the selected problem. Further, in [
4], the authors used different signals, over a wide range of cutting conditions, using a back-propagation neural network to predict the flank wear of a drill bit. Another approach which relies heavily on different sensors is presented in [
5]. In this case, various sensors were used to collect data, which were later integrated using a fuzzy logic model.
While sensor-based approaches are quite widely used, they are not always the best solutions. First of all, the equipment required to even start taking different measurements tends to be quite expensive and would also require lengthy calibration, which, if conducted incorrectly, can affect the resulting accuracy. Furthermore, a setup which contains multiple sensors might be difficult to integrate into an actual work environment without affecting the production process. With all those problems, as well as some additional requirements that might appear in different industries, regarding the desired accuracy or some additional properties of the final solution, such as limiting the number of critical errors (which corresponds to any mistakes between border classes of tool wear), a simpler input might be required.
In some previous works, images were used as a base for drill state evaluation, using various machine learning algorithms. Such solution was considered in [
6,
7]. In those cases, the specialized sensors were dropped entirely, and instead of signals, images of drilled holes were used for evaluation purposes, requiring only a simple camera to obtain them. The presented solutions are based mainly on convolutional neural networks (CNNs, which have the additional advantage of not requiring any specialized, diagnostic features; those networks are also considered top solutions in the case of image recognition, as mentioned in [
8]). The first of the two approaches uses the data augmentation technique to achieve dataset expansion without the need for additional samples, combined with a transferred learning methodology. Accuracy of 93% was achieved here, without the need for a complicated sensor setup. In the case of the second solution, a classifier ensemble was used to further increase the overall classification rate, exceeding a 95% accuracy rate. There are also some recent approaches that incorporate similar methodologies. In [
9], various CNN networks were tested and evaluated to prepare an improved approach that focused more on limiting critical errors that the classifier makes. In another solution, presented in [
10], a Siamese network was applied to the same problem, which is a new, CNN-based methodology. In both approaches, the window parameter, which included consecutive images of holes drilled in sequence, was used to further increase the achieved results. Finally, in [
11], a more time-efficient approach was presented, this time using image color distribution, with an assumption that, after converting the image to gray-scale, there will be more pixels with mid-range values within images representing holes drilled by more used tools. All those solutions achieve high accuracy results and a relatively low amount of critical errors.
What can be noted is that while most of the presented solutions take into account the manufacturer requirements, they also have some drawbacks. First of all, in the case of more difficult examples, the solutions tend to make different errors in the final classification (some more severe than others). Second of all, the manufacturer cannot easily switch between different metrics used to evaluate the final solution. Those drawbacks led to the current approach, in which images are still used as input data, but instead of using a hard classification model, which assigns classes to each presented example, a more elastic approach is incorporated. Instead of classifying each sample as belonging to one of the classes (green for a tool which is new, red for a tool that should be discarded, and yellow for one that requires further evaluation), a confidence metric is incorporated to inform the user how exact the current classification is. Samples can then either be further classified or discarded and assigned to the human expert for that purpose. Furthermore, the solution can be adjusted to focus either on accuracy or the dataset coverage—since different industries might have varying requirements in regard to those aspects, such approach provides the user with more control over how the presented solution works. It will also allow for easier adaptation to chosen problems.
The novelty of this work is that it provides a robust way to quantify the uncertainty of any multi-class classification into a confidence parameter that allows us to discriminate some observations with low confidence in order to increase performance metrics for the rest of the observations. This approach allows easily combining human expert knowledge and algorithm ways of classification and can be added on top of the multi-class classifier.
2. Methodology
In previous work (see [
11]), it was noted that while using a set of images converted to gray-scale, the amount of pixels placed in the mid ranges can be used to evaluate the state of the drill that was used to prepare the hole shown on each image. The research presented in this article continues on this assumption, but instead of hard classification, where each image is assigned a single class, the samples are evaluated in terms of decision confidence.
The presented confidence evaluation process consists of a few steps, involving initial data processing, model preparation, and the confidence function itself.
2.1. Dataset
The dataset used in the current experiments contained a total of 8526 images showing holes drilled by steadily declining tools. For the initial class evaluations, the resulting sample set was manually labeled. In the case of the presented dataset, external corner wear—W (mm) was used as a decisive factor for assigning a class to each image. This parameter was measured using a workshop microscope, TM-505 Multitoyo, Kawasaki, Japan. According to experts in this field, the parameter ranges were established as follows:
W < 0.2 mm—drill classified as green;
0.2 mm < W < 0.35 mm—drill classified as yellow;
0.35 mm < W—drill classified as red.
From those images, 3780 were classified as the green class, 2800 samples were classified as the yellow class, and 1946 represented the red class. Images were chosen as input data, since, while showing the declining state of the drill (edges tend to be more jagged in the case of more used tools than in the case of new ones), they do not require a significant amount of time to obtain, and the acquisition process itself can be adjusted to the specific needs of each manufacturer. All samples used in the current research were obtained in cooperation with the Institute of Wood Sciences and Furniture at Warsaw University of Life Sciences. The summary of the dataset is presented in
Table 1 below.
Holes were drilled with a standard CNC vertical machining center, Busellato Jet 100 (Busellato, Thiene, Italy). To ensure that the entire setup is as close to the potential manufacturer requirements as possible, the drilled material was a material typically used in the furniture industry, laminated chipboard U511SM Swiss Krono Group (Swiss Krono Sp. z o.o., Żary, Poland). The drill used for this application was a 12 mm Faba WP01 (Faba SA, Baboszewo, Poland) with a tungsten carbide tip. Initial test piece dimensions were 2500 mm × 300 mm × 18 mm. To acquire the actual images, they were later divided into smaller ones, which were separately photographed. Example fragments representing each of the recognized classes are presented in
Figure 1. Final images, representing one hole each, were obtained using a custom script, which extracted the desired area and saved it in separate images with three RGB color channels. The images were stored in the exact order in which they were made, facilitating easier evaluation of the obtained results, but the time series structure of the obtained data was not incorporated into the current solution. Example images showing the input images used by the presented procedure are shown in
Figure 2.
All the images in the current dataset were manually labeled as one of recognized classes by a human expert and later used by the prepared model.
2.2. Model
Images converted to gray-scale using ITU-R 601-2 luma transform specification were the input data for the following algorithm steps. During model preparation, the initial classification based on the overall grouping of pixels was conducted. The initial research presented in [
11] showed that, although the images with holes of degrading quality show a steady increase in gray pixels (the pixels in each image were divided into three groups for that process—black for the hole, white for the laminated chipboard surface, and gray for the hole edge), there is no clear border between each class; hence, the images cannot be easily classified using only that count. At the same time, the general relation between the number of gray pixels and the quality of the drilled hole still remained, with images of holes belonging to the green class having significantly less gray pixels than those that belonged to the yellow or red classes.
During this initial step, the image preparation was conducted. Original images were represented in RGB and varied in size (the custom script prepared for this phase focused on cutting out the fragment of the image containing the hole, with the edges, and with as little detail from the surrounding sample as possible, but including any jagged parts of the hole edge). Due to that variation, in the first step, images were resized to 256 × 256 pixels to make sure that they have a uniform size. Additionally, since the information regarding the state of the hole edge does not require color values, the images were converted to gray-scale.
The next step involved counting occurrences of each pixel value and normalizing them to fit in the 0–1 range. This was accomplished by a simple operation of dividing each pixel count by the total number of pixels in the image. From those counts, an array containing 256 pixel values was prepared, which was later used as an input to the next part of the model. This element used Light Gradient Boosting Machine (also LGBM or Light GBM, described in [
12]). LGBM uses tree-based learning, which grows trees vertically, with the maximal delta loss for leaves, and can handle larger datasets while using less memory. This approach also focuses on accuracy and has an efficiency parameter used as one of the main quality indicators. During experiments, 15 rounds of Bayesian optimization were used to choose and optimize hyperparameters with a multi-log loss metric. Data obtained from this element were later used in the following steps of the presented procedure.
In order to obtain probabilities with a window size of 1 (meaning that only a single image is taken into account, and not a sequence of images), the 5-fold cross-validation method was used. The presented method can also be easily expanded to include larger windows. Meanwhile, the baseline accuracy for the chosen set of parameters was 0.67 (these results were obtained in previous work [
11], where no feature selection approach with a window equal to 1 achieved exactly the same result). Given the probabilities’ distribution, achieved by 5-fold cross-validation, we calculate the confidences for each of the 4 different confidence functions defined in
Section 3. For each of those 4 results, we calculate different metrics of how well they achieved their goal of measuring confidence, which are also defined in
Section 3. We compare the results of those metrics in
Section 4.
2.3. Problem Formulation
The main metrics considered with the current classification model were accuracy and the number of severe (red–green) errors. In order to increase both of them, several approaches can be adapted, including using a set of subsequent images instead of single ones, as conducted in previous authors’ work [
9] using window parameters. After that, the dominant value of all classifications can be used. This method, although it increases the classification rate in general, is only suitable for some problems, especially since it complicates the deployment of the classifier iinto the manufacturing process (the industry needs to be able to produce subsequent images during production).
In the current approach, instead of hard classification, where an image is assigned to one of the recognized classes, an additional class or state was added, making it possible to return an “I don’t know” or an “undefined” pseudo-class, where some observations will not be classified at all. Later on, those samples labeled as “undefined” can be further examined by human experts. While this approach is not fully automatic, it can eliminate the majority of observations with a clear classification, leaving only harder and more interesting examples for manual evaluation, possibly resulting in better performance of the entire solution. The folds are the same as in
Table 1; therefore, this process is based on 5-fold cross-validation. The overall structure of the presented solution is shown in
Listing 1, where classification and calculating confidence are presented, and in
Listing 2, where different confidence metrics are calculated.
Listing 1.
Classifying observations, and calculating confidence.
Listing 1.
Classifying observations, and calculating confidence.
Listing 2.
Calculating confidence metrics.
Listing 2.
Calculating confidence metrics.
3. Confidence Function
The approach presented in this paper is based on a confidence function, which will describe how sure used the model is of the presented classification. The results which have low confidence can be discarded, hence leaving only those with higher values in that aspect. By using such filtered samples, it is believed that better results can be obtained when compared with using the entire dataset with an unsure classification. An ideal situation will, for example, drop 2% of the least confident results, boosting the actual accuracy by around 10%.
3.1. Confidence Function Constraints
In order to use some methods as confidence functions, a set of constraints needs to be defined first. Assume that a result’s probability vector of a multi-classification problem with a number of classes of
n ≥ 3 is given with probabilities obtained using the softmax function (which is a common practice in neural networks and other models [
13]).
To achieve comparable results for different confidence functions, the function should be able to transform an n element vector of probabilities into a single value in the range [0.0, 1.0].
The confidence function
should satisfy three constraints:
where
is the probability vector containing equal probabilities,
is the probability vector with all but one element equal to 0, and
is the set of all probability vectors with length
n.
Given some confidence threshold t, all observations that have a confidence score lower than t are categorized with the “undefined” pseudo-class. The coverage c is the fraction of all observations for some threshold t that are still normally classified, and we will denote the accuracy of that classification as a. Increasing the threshold should increase accuracy, but it will decrease coverage.
3.2. Confidence Function Candidates
While the constraints and some general assumptions for the presented model were defined, it still requires some candidates for the confidence function to be pointed out. For the initial approach, a total of three methods were considered: Shannon, Gini, and Norm confidence, which are outlined below.
3.2.1. Shannon-Based Confidence
Shannon entropy is a good candidate for confidence, as it is used as an inequality measure [
14]. It can be defined as presented in Equation (
4).
where
is an
n dimensional probability vector, and
is the entropy with a logarithm base equal to
n. It satisfies all constraints because the maximal possible entropy achieved with
is 1, and for
, it is 0.
3.2.2. Gini-Based Confidence
Another measure of inequality [
15] is the Gini coefficient. In order to satisfy the chosen constraints, they should be normalized by the Gini coefficient of the unitary vector
. Gini confidence can then be defined as shown in Equation (
5).
As is 0, and maximal inequality is achieved by , this confidence measure also satisfies all constraints.
3.2.3. Norm-Based Confidence
To use a slightly different approach, the inequality of a given prediction can also be measured as the distance to the closest unitary vector. In order to satisfy the constraints, the distance needs to be adjusted, considering the maximal possible distance, which is the distance from
to
. To choose the distance metric, standard
,
, or
norms can be used, as shown in Equation (
6).
The maximum norm
is the baseline comparison of all confidence functions (see Equation (
7)). It is the simplest one of the presented functions, and it just measures the distance between the maximum probability and 1.0, standardizing it to be between [0.0, 1.0] so it can fulfill the constraint set.
It is worth noting that with that scaling,
-based confidence gives the same results as
-based confidence; therefore, a baseline value can be obtained using either of those functions.
3.3. Comparing Confidence Functions
To select the best confidence function, which will be one that maximizes the accuracy a and coverage c for all thresholds t, a direct comparison of the chosen confidence functions is required for given model outputs. One additional factor that also needs to be included is the actual gain the confidence function gives for the current approach. This is achieved by comparing the confidence threshold accuracy with the default approach which does not use any confidence at all (default accuracy for the presented model).
3.3.1. Accuracy Threshold
Since the presented approach aims at being as versatile as possible, it is worth noting that depending on the specific application, the accuracy constraints might differ, requiring the solution to achieve specific values in that aspect. For that approach, the best confidence function would be one that, for the chosen accuracy threshold a, will achieve a confidence threshold t that it ensures the best coverage c.
In the presented case, the default accuracy of the used classifier is 0.67, and the goal accuracy is 0.80. The threshold t for Shannon-based confidence is 0.33, which corresponds to 0.80 accuracy a and 0.55 coverage c. For -based confidence, the threshold t will be 0.41 with 0.46 coverage c. Therefore, in that problem, Shannon-based confidence is a better confidence function than -based confidence.
3.3.2. Coverage Threshold
Another approach to this problem would consider, instead, that the confidence function should maximize the accuracy a for observations above the confidence threshold t corresponding to a given coverage threshold c. In general, this would correspond to eliminating the most problematic fraction of cases from the set, and maximizing the accuracy on the rest of the observations.
For example, in the presented model, let us assume a 0.9 coverage threshold is our goal. With Gini-based confidence, the corresponding confidence threshold t would be 0.42 with accuracy a equal to 0.69. With the same requirements, for -based confidence, the confidence threshold t would be 0.02 with accuracy a of 0.67.
3.3.3. Weighted Accuracy Gain
Weighted accuracy gain measures the weighted sum of the difference between different threshold accuracies and default classifier accuracies, as shown in Equation (
8).
where
is accuracy with a confidence threshold
t,
is accuracy with a confidence threshold of 0, which corresponds to the baseline classifier accuracy,
is coverage with a confidence threshold
t, and
n is the number of thresholds considered.
3.3.4. Confidence Area under Curve
In the case when no accuracy threshold is given, several confidence thresholds can be checked instead, and the function maximizing both accuracy and coverage should be chosen. This problem is similar in formulation to the receiver operating characteristic area under the curve—roc_auc [
16]—and we will be using the auc shortcut in the next sections.
The area under the confidence curve (as shown in
Figure 3) can be calculated using the trapezoid rule. The used points were constructed in pairs (
a,
c) that correspond, respectively, to accuracy and coverage for each of the confidence thresholds
t, from 0 to 1.0. It was assumed that an accuracy with confidence of 1.0 is also one of the thresholds with the lowest coverage. The baseline for this is also the accuracy result of the initial model, which does not include any type of confidence.
5. Conclusions
In this article, a novel and adaptable classification algorithm was described. While the presented solution was mainly applied to drill wear classification, it is not limited to this task. Instead of focusing on hard classification, transferring to assigning a single class to each example, the method focuses more on evaluating confidence for each considered sample. While there is no clear winner in the evaluated confidence function set, the performance of at least one function for each accuracy or coverage threshold was at least acceptable. For some cases, the differences between different functions were negligible, while their performance was still satisfactory.
Even in its current state, the presented solution is quite versatile and can be easily adapted to any number of recognized classes. Furthermore, due to the confidence metrics applied, the model can be better evaluated in that aspect, pointing out how sure the classifier is when assigning a certain class to each sample. As an additional feature, depending on the manufacturer requirements, the method can focus more on obtaining the required accuracy rate, or covering a chosen fraction of samples. Finally, by discarding some of the examples, labeling them as too complicated for automatic classification (for the cases when the metric will show a confidence below the assigned threshold), and evaluating them by manual experts, the accuracy rate of the entire solution should perform better, avoiding severe errors which tend to be a problem with fully automatic solutions. All of the above features expand the available applications of the prepared algorithm. Additionally, when combined with a possible focus on either accuracy or dataset coverage, the overall functionality makes the presented solution more suited for various classification tasks that may appear in the wood and other industries.