1. Introduction
Reliable production monitoring data collected on long term is of a crucial importance in many industries because it provides an informed background for resource allocation and saving, optimization and operational improvement [
1]. In many ways, productivity increment in different types of operations is seen nowadays as one of the founding factors of competitiveness enhancement, and monitoring data is commonly gained by different types of surveys, starting from less advanced, generalist ones, and ending with those able to produce accurate and detailed quantitative data, in real time. While the wood processing industry makes no exception from that, it is frequently seen to hold a limited capability to achieve an efficient production, which may be the effect of low technical and allocative efficiencies [
2], as well as of the missing monitoring data, with the latter preventing the science to find solutions for the problem. The situation is even more bottlenecked in the case of small-scale sawmills, which are relying on simple machines, that do not integrate production monitoring systems, operate at low production rates [
3,
4,
5], and do not hold the financial ability to procure sophisticated monitoring systems. At least in such cases, the production monitoring solutions are few and limited by the amount of resources needed under the regular or advanced approaches to the problem.
In this regard, long-term assessment of efficiency and productivity requires data on production (i.e., the amount of manufactured products) and time consumption (i.e., time spent to manufacture the products) [
6]. Moreover, to build a clearer picture on the factors that should be (re)engineered for a better performance, productivity studies are often carried out at elemental level [
7]. Of a particular importance in production monitoring is also the ability to identify and delimitate different kinds of delays, which gives the computational basis for the net and gross productive performance metrics [
7]; in relation to the delay-free time, often the studies are framed around the main functions that a machine or tool may enable, with the functions being also interpreted in a spatial context.
While the data on production is often handy to get because it forms the basis of transactions on the market, monitoring of productive performance relies on time-and-motion studies that can be done at different resolutions and by different means [
7]. Regarding the means, and by assuming an absence of integrated monitoring systems, the current options may include traditional chronometry studies [
3,
8], video surveillance and the use of other kind of external sensors [
9]. For long-term monitoring, however, the approaches by video surveillance and the use of different types of sensors have the most promising potential due to their ability to collect and store events of interest on long time windows. In particular, video surveillance holds the important capability of capturing and storing the real sequence of events [
1]; however, the office effort needed to analyze the data by human-assisted interpretation may be challenging, especially when the observed processes or study designs are complex [
10].
To overcome this situation, solutions are needed to learn and classify the events of interest and their associated time consumption directly from the video files. Such an option could be enabled by the use of a series of properties and methods associated with video surveillance, signal processing and artificial intelligence. One of them refers to actually producing a useful signal from the collected video files and it is related and enabled by the recent developments in the algorithms [
11] and software for video tracking applications [
12]; the latter are based on the supervised definition of an object on a given frame followed by its detection and tracking on the subsequent frames [
13]. By doing so, the frequency of the frames may be used as a counter to compute the time consumption on events and, as such, object tracking may be implemented online or offline and it may refer to short-term or long-term tracking of single or multiple objects [
14]; it works well on 2D scenes to detect new locations taken successively by the tracked objects. The motion identified by tracking a given object in successive frames can be then used to produce a discrete signal in the scalar domain, a property which is enabled by offline video-tracking software such as the Kinovea
®, which has been typically used in the science related to the human performance monitoring and to the kinematics of human body segments [
12,
15,
16]. The individual scalar signals may be then moved in the time domain and used, either directly or after the application of different filters, as inputs for supervised learning algorithms or processes, such as the artificial neural networks (ANNs). In this regard, the ANNs [
17] and other techniques of artificial intelligence (AI) are able to solve multivariate non-linear problems which are quite common in the case of using as inputs for classification nonlinear signals characterizing multi-class problems [
9,
18,
19]. Similar to the object tracking software applications, implementation of ANNs, as well as of many other types of classification algorithms, has become affordable lately by the development of free open-source software.
The goal of this study was to test the performance of a system implemented for data collection, signal processing and supervised classification with application in the long-term performance monitoring of small-scale wood processing facilities. One of the basic assumptions and requirements in developing and testing the system was that of using to a great extent the freely available tools and software to produce accurate classifications of the operational events in the time domain. The system consisted of using an affordable video camera to collect the data and the freely available Kinovea
® (version 0.8.27,
https://www.kinovea.org/) and Orange Visual Programming Software
® (version 3.2.4.1, Ljubljana, Slovenia) software tools to produce the needed discrete scalar signals and to check the accuracy of the classification. Therefore, the objectives of this study were (i) to check the classification performance of the operational events in the time-domain by using the original signals as a baseline, less resource-intensive alternative; (ii) to check the classification performance enhancements, if any, by adding a derived simple-to-compute signal, based the original signals as an additional, more discriminative solution for the implementation of the ANN algorithm; and (iii) to check the classification performance enhancements, if any, brought by signal filtering to their roots, by median filters, as a fine-tuned, discriminative solution for the implementation of the ANN algorithm. Acknowledging the importance and differences that may occur in the classification performance in the testing phase, for the achievement of most of this study objectives, the workflow of the ANN implementations has been restricted to the training phase. Further, the ANN implementations were done at two levels of detail, of which one characterized the data documented at the finest possible level (elemental study) and one characterized the data aggregated at two discriminant levels which are important for production monitoring: machine working versus machine non-working.
2. Materials and Methods
2.1. Video Recording and Media Input
The media files used to test the system were captured by video recording in a sawmilling facility by the use of a cheap small-sized camera which was mounted near the steel frame of the sawmilling machine with the field of view perpendicularly oriented towards the active frame (
Figure 1). The surveyed machine (Mebor, model HTZ 1200) operates in a similar way to that described in [
9] with the main differences consisting in its propelling system, which was electrical, and the maximum allowable size of the input logs which was higher. During the field study, the machine operated with Norway spruce logs that had diameters in the range of 26 to 69 cm (average of 45 cm), by the free willingness of the operator in what regards the settings and sequences of cutting, at an air temperature of ca. 25 °C, without any camera vision interferences caused by the sawing dust.
At the field study time (2017), the placement and use of the camera were intended to collect the data needed to estimate the productivity of operations by a rather traditional approach which supposed a manual measurement of the wood inputs and outputs and a chronometry method which was based on regular video surveillance. That is the reason for which the camera was used to continuously record a full day of operation as well as for not placing and using any kind of arbitrary highly-reflective markers on the moving parts of the machine. Nevertheless, by its placement, the camera enabled the video recording of operations in a distance range of ca. 1 to 5 m; it produced a set of video files of 20 min in length each (by settings), at a video resolution of 1280 × 720 pixels and at a sampling rate of 21 frames per second (fps). During recording the light conditions were good, ensuring a good visibility in the collected files, which was enabled by the natural and artificial light available in the hangar of the facility. For the purpose of this study, all the video files were analyzed in detail by playing them in the office phase, to check them against two selection criteria. A first one was that of having in the recording all of the work elements typical to the machine surveyed, including here different delays that characterized non-working events of the machine. The second condition referred to the non-obstruction of the field of view by different moving features such as the interference of workers or other machines. Based on this analysis, a media file was selected for further processing and parts of it which failed to meet the above criteria were removed.
2.2. Signal Extraction, Processing, and Event Documentation
Extraction of the discrete scalar signals from the media file was done by the means of the Kinovea
® free software. An example of the settings used is given in
Figure 1. To provide a reference for measurement by tracking, a convenient coordinate system was chosen and set with the
y-axis close to the middle of the field of view captured in the media file; the origin of the coordinate system was defined close to one seventh of the field of view’s height (
Figure 1). Based on these preliminary settings, the tools for trajectory configuration were used to set up the tracking point. This was enabled by the presence on the machine’s frame of some distinguishable geometric features resembling typical markers (
Figure 1) of which one was selected as a reference in the first frame taken into analysis. Following the selection, the effective tracking of the machine’s movements was done automatically at 50% of the real running speed of the media file and the data outputted this way was then exported as a Microsoft Excel (Microsoft, Redmond, WA, USA). XML file. For each frame, it contained the coordinates given in pixels and the current time of each coordinate pair (
x,
y). This output formed the reference dataset of this study, and it accounted for a total number of 13,116 frames.
Based on the extracted data, and for convenience in graphically reporting some of the results, the original data collected on the two axes was downscaled by a factor of 1/100. The resulting datasets (XREF, YREF) were then used as direct inputs to compute a new, derived signal (ΔXYREF), in the form of a restricted positive difference between XREF and YREF (that is subtracting YREF from XREF and when this gave negative values these were included as positive values in the analysis) and for filtering purposes assuming a median filter implemented over a 3-observation window, which was iterated until reaching the root signals. These measures were taken to improve the separability of data under the assumption of less noise in the signals’ patterns.
The benefits of using the median filtered data are those explained, for instance, in [
20]; in short, filters from this class may improve the signal-to-noise ratio and provide an unaltered (by truncation) dataset, for instance, in the time domain. This property is important in preserving the time consumption distribution on categories and the approach of median filtering was similar to that detailed in [
9]. In what regards the use of root signals, the approach was based on the theoretical backgrounds given in [
21] and it was carried out to reach a better uniformity of the signals and to remove the noise from the original data due to inter-pixel movement of the tracker. It is worth mentioning here that a limited calculation effort was assumed right at the beginning, based on an initial plotting of
XREF and
YREF in the time domain, as well as that the approach may be suitable only for applications such as that described in this study since the computational effort to get the root signals may be extensive by definition [
21]. The procedure implemented for median filtering to get the root signals required five iterations in the case of
XREF and six iterations in the case of
YREF. Based on the filtering results, a new set of signals was developed (
XROOT,
YROOT and
ΔXYROOT) taking as a reference the number and order of observations specific to
YROOT. The approach has led to a minor data loss at the extremities of the datasets (12 observations in total).
Table 1 is showing the six signals used in the training phase of the ANN.
Data coding was done by analyzing in the best possible detail the media file and by considering the kinematics of the machine. This procedural step has used the work sequences and codes shown in
Table 2. In general, machines from this class operate by adjusting the cutting frame height by upward or downward movements before cutting and before returning the frame to start a new work sequence; in addition, they enable the forward movement of the cutting frame to carry on the active cut and backward movement, in the empty turn, to reach the point of starting a new operational sequence.
As such, to detach a wood piece from the log, the typical sequence was that of moving the frame downward, then moving the frame forward while carrying on the cut, moving the frame upward and moving the frame backward. This sequence was conventionally adapted to the order of log processing. Using the codes attributed by the analysis of video files, four datasets were developed and used in the ANN training. The first set contained the data of
XREF,
YREF, and
ΔXYREF coded in detail as shown in
Table 2 and the second set contained the same signals for which the data was coded for working (
W) and non-working events (
S). The same data organization procedure was used for the last two datasets which contained the
XROOT,
YROOT, and
ΔXYROOT signals, which were documented in full and essential detail, respectively. Therefore, the analysis of the two groups of signals was extended to include two alternatives: fully detailed (
FULL), which included the events
MD,
MF,
MU,
MB, and
S, and essentially detailed data (
ESSEN), respectively, which included the events
S and
W.
2.3. Setup and Training of the ANN
Setup of the ANN for training has used the freely available Orange Visual Programming Software (version 3.2.4.1) [
22]. The main parameters of the ANN were configured similarly to those explained in detail in [
9], by assuming the same reasons in regards to the computational cost and classification performance. The setup was based on the use of the rectified linear unit function (ReLu) as an activation function, Adam solver and the L2 penalty regularization term set at 0.0001; additional settings consisted of using three hidden layers of 100 neurons each as well as of using a number of 1,000,000 iterations for training a given ANN model. For all models, training and scoring were done by cross-validation assuming a stratified approach and a number of folds set at 20. The training procedure has used two signal sets (
REF and
ROOT) of seven possible combinations each, as shown in
Table 3. They were used to account for the designed analysis resolutions and to produce data by training to see which one could output the best results. Using the approach described, a total number of 28 ANN models were trained. For example, the
REF1 referred to training of the ANN using the fully detailed
XREF signal first and then to training of the ANN by using the essentially detailed
XREF.
Following the training procedure, the most commonly used performance metrics (
CA—classification accuracy,
PREC—precision,
REC—recall, and
F1—the harmonic mean of
PREC and
REC) were calculated for each of the trained ANN models. Definitions, meaning, and interpretation of these classification performance indicators may be found, for instance, in [
23,
24]; for a supplementary check of the classification performance, the area under curve (
AUC) was computed for each model. While all of the computed metrics are important in characterizing the classification performance, the focus of this study was on the
REC metric, following the reasons given in [
25] which apply to time-and-motion studies. Configuration of the computer used to run the training of the ANN models was that given in [
9] and to differentiate between the training costs incurred by the potentially different complexities of the signals, the time of training was counted, in seconds, for each model.
2.4. Data Processing and Analysis
Data processing and analysis was done mainly in the Microsoft Excel® (Microsoft, Redmond, WA, USA, 2016 version) software and it relied on the visual comparisons of the data. It was assumed, therefore, that statistical comparisons of the classification performance metrics will bring no relevance due to the fact that the outcomes of training were quite different, as well as due to the fact that even a small difference found in a given pair of metrics could cause significant effects in such a case in which the results would have been scaled to larger datasets. As a first step of data analysis, data coming from the XREF, YREF, and ΔXYREF signals was plotted in the time domain against the codes attributed for the events identified and delimited at the FULL and ESSEN resolutions. While this helped in understanding the kinematics of the machine, in the results section, only a partition of the most representative data was given due to the limited graphical space. Based on the resolutions taken into study (FULL, ESSEN) and on the signal combination classes (REF, ROOT) the main descriptive statistics were computed and reported for the events encoded in each of them as the absolute and relative frequencies. Then, the training time was reported for each signal, on combination classes and resolutions taken into study.
Classification performance was reported at the same levels of detail, by a graphical comparative approach. At this stage, however, only the global classification performance of each model has been taken into consideration for reporting. Based on its outcomes, the best models were selected for both, FULL and ESSEN study resolutions and the classification performance metrics for these models were reported in detail, at the event level. The last task took into consideration a more detailed analytical approach to identify and characterize those events that were misclassified. For this step, the data of the two models retained as holding the best global classification performance was exported from the software used to train the ANN into Microsoft Excel® (Microsoft, Redmond, WA, USA, 2016 version) where sorting procedures were taken to account for the number and share of misclassifications at event level; this step was complemented by a graphical representation of two examples of misclassifications extracted as being relevant for the models taken into analysis.
4. Discussion
Classification outcomes, as outputted by the system taken into study, were encouraging, a fact that may be discussed from at least two points of view. First of all, the precision of classification is generally termed as being very high if for a given application one achieves values of 90% or over [
23]. However, what is to be considered is also the utility of the metrics, their magnitude and the role of a given class in a given application. From this point of view, the classification recall (
REC) would be the best choice to evaluate the performance of the models. As it reached values of 92.3% and 97.3% for the
FULL and
ESSEN resolutions, respectively, then one could expect misclassifications of time consumption associated to the recall metric of ca. 1.5 (
ESSEN) to 4.5 (
FULL) minutes per one hour of monitoring. However, this assumes that one will use the models as given herein without any other checks on the geometry and other features, while the data for the active cutting (
FULL) and machine working (
ESSEN) will lead to much fewer misclassifications. On the other hand, it was shown that improvements of classification ability may be obtained by data scaling to standardize the inputs [
26] so as to reach a mean value of zero and a standard deviation of one for the input data. While for the application described herein the eventual improvements brought by this kind of approach still need to be checked, what is clear for now is that they will also need an additional computational effort.
Since the performance of the ANNs is related to the data used (signals, patterns) as inputs, then one good approach to improve the classification outcomes will be that of getting very good signals, including by their augmentation [
23] where there is the case. As such, the ability to convert an analogous signal into a good digital one and to augment the inputs provided by a digital signal may be done in several ways, by having in mind both, the acquired signal and the underlying process [
27]. From this point of view both, deriving a new signal as well as filtering the used input signals to their roots could be seen as some sort of data augmentation. As the first approach actually led to an increment of the general classification accuracy, one may conclude that it can bring improvements for applications such as that described herein. However, its usefulness would depend also by the acceptability of the classification outcomes since the improvement was of 0.6% compared to that of using the raw signals in the multiclass (
FULL) resolution problem and of 0.2% in the binary (
ESSEN) resolution problem. It is worth mentioning here that the raw signals (
REF) were produced at half of the original media speed, therefore, the trade-off between accuracy and saving time sources in the office phase should be explored further.
The above mentioned, lead naturally to the data collector, its capabilities, setup, and settings, which could hold the key for producing better signals. For a data collector such as that used herein, and by considering the classification recall metric, it was found that the lower performing events were the upward, downward and backward movements. This may not be erratic given the camera capabilities and location, the background data used to produce the raw signals, the distance to the surveyed events, and the speed at which the events occurred. In what concerns the backward movement, it was found to be done at considerably higher speeds compared to the forward movement, a fact that could have been affected the output signal at least by the sampling frequency. This was also the case of upward and downward movements, with the latter ones being done in the field of view’s background; therefore, one may just assume that the speed at which the events occurred and their distance from the camera affected the classification performance. The general classification outcomes may be related very well also to the resolution of the camera and to the sampling frequency. For instance, Ref [
28] have indicated that monitoring the movement prediction errors in construction sites may be affected by the frequency of sampling (i.e., number of frames per second). On the other hand, higher sampling rates have produced excellent results even for faster events [
16]. Hence, a camera holding a better resolution, a finer sampling rate and an improved shooter speed, may have been improved the outcomes, given the fact that the software used to produce the signals works by a frame-by-frame, pixel-by-pixel approach. Since the camera itself is not the only one component of the data collection system, one may think if the use of well-designed markers placed on the machine’s frame could have been produced better signals. As such, many applications of the Kinovea
® software have produced excellent results from using markers [
12,
15] while the software itself has been used in many types of applications that supposed the analysis of motion [
12,
15,
16] or inter-validation of different methods [
16,
29].
This study addressed only those events that were related to the machine use. Therefore, long-term applications could include also external events such as feeding the logs and log rotation on the machine’s platform (partly included in this study). It is to be checked to what extent a multi-tracking approach could enhance a better event and time consumption classification for a setup such as that from herein, as well as for a setup designed to monitor other active parts of the machine such as the devices used to hold and rotate the logs. Accordingly, for a finer tuning of the system, it is to be checked to what extent a calibration of the camera could solve other problems such as actually getting variables related to the size of the logs and of the processed wood products. However, by assuming a system such as that described herein, what could matter for long-term data collection sessions is that of finding and holding a non-obstructed and non-interfered position for the camera. Moreover, the system’s financial performance and sustainability are to be checked against those systems based on cheap external sensors [
9].
Last, but not least, this study has used an ANN architecture which is just one of the several techniques of the AI. Our choice for this technique was based on its performance and popularity [
23] and mainly on its ability to solve multivariate problems [
19] and to extract meaningful data from complex patters [
25]. Nevertheless, future studies should check the performance of other techniques such as the support vector machine (SVM), Bayes classifier (BC), or random forests (RF) to see if they could output better results. Furthermore, the approach of this study was just that to check the performance of ANN architectures to classify data. This was done only by training and has outputted excellent results. However, an extension of the system to get long term data would be beneficial to build more robust models and to keep a separate subset for testing and validation, as these are the typical steps for ANN development and deployment [
17]; it will also extend our understanding related to the use of algorithms, electronics and computer software in the assessment of operational performance in the wood supply chain by adding new approaches to the known ones [
30,
31,
32].