1. Introduction
The establishment of pest populations outside their native ranges is facilitated by climatic change and global trade [
1]. Larvae of non-native invasive species are accidentally transported in wood packaging of globally traded goods through ports, handling facilities or truck roads in places that are not biologically adapted to regulate their multiplication (i.e., low plant resistance and absence of natural enemies) [
2].
In Europe alone, wood-boring species of the coleopteran family
Cerambycidae (longhorn beetles) native at various parts of Asia, are now considered established or establishing in Spain, Italy, Austria, Germany, Finland, France, Montenegro, Switzerland and Turkey. Longhorn beetles are attacking at least 140 different tree species including citrus and stone fruits (peach, nectarines, plums, cherries and apricots) as well as forest woodland (infested trees become unsuitable for pulp and wood exploitation). Stone fruit production and production of roundwood for industrial uses is estimated to be several billions of dollars, and, therefore, the cost of damages to these products is significant as well as the cost of eradication efforts and control [
1]. Cerambycid female borers lay their eggs under the bark or in physical cavities or wounds/cracks of their host trees. Newly hatched borers chew directly into the heartwood. Chewing under the bark in living wood severely damages water- and sap-conducting tissues. Adults emerge from infested trees in spring or summer after penetrating the bark, therefore causing an additional problem: the exiting tunnels become the entry points for several plant pathogens. The repeated tunneling from many borers, over many generations, gradually weakens the health of the tree, causes structural instability (wind breakage), drop of fruits and leads to the decline and eventual death of susceptible trees [
3].
There is a wide bibliography [
4] on optical [
5,
6], laser vibrometry [
7], piezoelectric sensors, and accelerometers [
8,
9,
10,
11], used to detect locomotion and feeding sound of larvae or adult pests inside the tree trunk. A mildly intrusive and widely applied method for inspection of commodities is based on inserting a piezoelectric probe in the tree trunk to listen for potential internal audio activity due to feeding and locomotion (i.e., passive acoustic detection) [
12,
13,
14,
15,
16,
17,
18,
19]. The feeding activity is audible [
13,
14,
15,
16,
17,
18] in two of the biological stages of the insect a) when the mature larvae tunnel into the sapwood or heartwood to form a pupal chamber (February–April), and b) after their exoskeleton is fully hardened and the adults dig emergence tunnels through the bark to exit the trunk (late spring to summer). The larva activity has a lower audio imprint than adult activity whereas during the egg and pupation stages in the pupal chamber the pest is inaudible. In brief, the benefit of a piezoelectric probe is that it is portable and practical, has lower cost than competitive methods (e.g., vibrometry), is much more sensitive than microphones, does not require mains supply, calls for minor training and there are commercial products available for practitioners.
Currently, all detection methods are manually applied. A trained technician must examine and decide in situ on the state of the infestation. The current manual approach has several shortcomings:
(a) Field visits and frequent manual inspection of trees and other plants are costly, cumbersome and impractical to be scaled to large numbers of trees.
(b) The listener has limited time to inspect a single tree and the larvae could be present but inactive during the inspection time for several reasons: for example, the pest may happen to be in an inaudible biological stage (e.g., egg or pupa) but will evolve in the short run, or the trunk may have low infestation load and the pest may not be chewing during the inspection’s time-slot.
We have shown in [
19] that a piezoelectric device can record and transmit the vibrations picked up by a probe inserted to a tree. The emphasis of this work is not on hardware implementation but on the nature of this particular vibrational signal and its classification in the presence of other vibrational interferences commonly existing in the field. We introduce fast, automatic screening of vibrational records based on deep-learning models looking at the spectrogram of the internal vibrations that can extend for weeks to months before reaching a decision on the infestation state of the tree (see
Figure 1 for a depiction of the main idea). We have used a commercial version of a piezoelectric device and recorded several thousand transmitted recordings extending over a period of 6 months mostly in urban environments. We train various deep learning approaches each one having its own merits. The database (the first of its kind for this kind of problem) is available for download at
http://www.kaggle.com/potamitis/treevibes (accessed on 13 February 2021) along with the associated deep learning code.
Our vision is ambitious: remote, automatic surveillance of trees against borers at global scales based on deep nets.
In this work, we demonstrate that this vision is technologically feasible; it creates services currently inexistent and is hampered only by the current cost of materials that can only drop in the future.
The structure of this work is as follows: We first examine the signal of wood-boring insects based on the example of
Xylotrechus chinensis (Coleoptera: Cerambycidae), an Asian woodborer also known as the Tiger longicorn beetle, causing high mortality of
Morus trees (mulberries) in Greece. In the context of this work, audio is based on the vibrations caused by the pests cracking the tree fibers. It is possible that elsewhere, different types of wood and different borers produce sounds with different spectral content but extended literature of the field (see [
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25] and the references therein) show that acoustic emission cannot be avoided. We then examine the vibrational soundscape of trees in urban spaces and forests and analyze the practical benefit of automatic remote surveillance of trees against borers. Subsequently, we describe deep learning techniques as applied to the spectrogram of vibrations originating from piezoelectric probes inserted in tree trunks. Finally, we conclude on future prospects especially on how our approach can be connected to the internet of things (IoT) reaching global scales.
2. Materials and Methods
In this section we start with basic principles of vibrational recorders and the nature of the signal recorded under different environmental conditions.
2.1. The Device
The core of the sensor we used for listening to vibrations caused by borers is the piezoelectric crystal. This is an electromechanical system (the crystal and an embedded amplifier) that reacts to compression by converting it to a fluctuation of an electrical charge. Therefore, it is closer to the concept of a seismometer than that of a microphone. In the context of our application, compression is inflicted by any vibration inside the wood, while the electrical fluctuation can easily be converted to an audio signal that can be stored, compressed and transmitted. A metal waveguide (see
Figure 2-left) is a metal bar, functioning as a sound coupler between the wood and the sensor probe.
The circuit is constantly in sleep mode, wakes up on a predefined time schedule (e.g., 20 s every hour) and takes a recording before going to sleep again. The recording duration and the density of the sampling is configurable through the reporting server. This means that there is a bidirectional wireless communication between the deployed devices and the reporting cloud server. The recordings are stored in the SD memory storage card, and the time-stamp is passed to the filename. All audio recordings are compressed using the open-source opus compressor prior to sending them over the communication channel (see
Figure 2-right for a field application). The bit rate is 24KBPS at a sampling frequency of 8 kHz. The device uses a global SIM card; therefore, any tree can be tracked from anywhere in the world. There is no need to recharge the device as it has an embedded solar panel that provides enough power for its low-power electronics. Therefore, it can stay on a tree for an indefinite time-period, sampling and transmitting the internal vibrations of the tree. The location of the device appears on the world map of the server as the device carries a global positioning system decoder (GPS). All data are communicated through the mobile network. Further details of a proof of concept of this approach can be found in [
19].
2.2. The Signal
Back in 2017, on the island of Crete, Greece, and in Spain, it was observed [
26,
27] that several mulberries appeared to bear exit tunnels (see
Figure 3-left) that had not been observed before. Suspicious trunks were sliced and the larvae found were subjected to polymerase chain reaction (PCR) analysis that showed that they belonged to the invasive cerambycid
X. chinensis (see
Figure 3-right). We took several trunks as in
Figure 3-left to the laboratory, where we made several recordings using a multichannel recorder (Tascam DR-680MKII) (
Figure 4).
In
Figure 5 (top) one can see a typical example of these recordings. Generally, the internal soundscape of a healthy tree—excluding externally induced vibrations—is silent at the level of audio sounds we seek. If it is infested one expects to hear a train of pulses like in
Figure 5-middle and the rate of insect bursts can be used to estimate the likelihood that the tree is infested [
28]. The train consists of a number of bursts, each one corresponding to a crack of fibers as the borers feed and move (see
Figure 5-bottom for a single burst). We can confidently attribute these impulses to
X. chinensis because the recordings have been taken in the controlled environment of the lab, the adults have emerged some months after the recording and the trunk has been subsequently sliced and examined for other possible insects. Looking at
Figure 5, one may suppose that the detection of borers is an easy task: an envelope follower or a simple thresholding could reveal the impulses. This is not the case as field recordings can be more complex than laboratory recordings.
A soundscape is a combination of sounds that arises from an environment. It refers to both the natural acoustic environment (animal vocalizations, weather sounds, rain) and sounds created by humans (traffic sounds, corns, footsteps, vocalizations). One may expect that the internal soundscape of a tree in the field is quiet and dull. However, it is not. With the term ‘internal’, we mean everything that a recording element located inside the tree would register. In the context of this work, we are interested only in sounds of borers but these must be discerned against any other possible forms of vibration. Two features of such signals must be taken into account when discriminating between the target insect signals and incidental noise:
(a) As mentioned in the introduction, depending on the biological cycle of the pest, it can be noisy or cryptic. Therefore, snippets taken from trees in urban spaces can be rich in vibrations originating from traffic, footsteps, vocalizations of dogs, birds and humans, shaking of the branches and leaves due to the wind and uncountable other unpredictable audio sources. Some of these vibrations propagate in the wood and reach the metal probe of the device. Therefore, recordings can be very noisy sometimes to the point that external noise dominates over the impulsive sound of the borer.
(b) One does not know if there are borers in the tree and even if there are, one cannot know their number and location inside the tree. Some of these impulses are feeble because they originate from a location distant to the probe. Depending on the kind of the wood, the probe can detect feeding sounds within a sphere of 1.5–2 m radius.
In
Figure 6 we gather characteristic examples of biophony, anthropophony and geophony taken from the transmitted field recordings of the TreeVibes database. In each sub-figure the top figure corresponds to the time-domain signal and the bottom to its spectrogram (i.e., the Short-Time Fourier Transform which is a representation of the change over time of the frequency composition of the signal). The sampling rate is 8 kHz and we use a hamming window of 512 samples with 50% overlap.
Figure 6a is taken from a young pine tree (not a host of
X. chinensis) in a forest with no signs of wounds or degradation. The recording is quiet with some distant bird chirps mainly seen in the spectrogram near 4 kHz. This is a typical recording of a healthy tree with a quiet background (usually at nights).
Figure 6b is taken from a mulberry with severe visual signs of infestation seen also as vertical strips in the spectrogram, indicative of impulsive audio events. The recording was taken in summer; therefore, it is most probably an adult
X. chinensis digging his/her way out. The tree is located near a busy street. At 4–5 s there are human vocalizations whereas from 12–18 s a passing-by car that vibrates the tree. The impulses of adults digging their tunnel out in summer are much stronger than the sound of larvae in the beginning of the year. Yet, both sounds are clearly audible.
Figure 6c is an infested tree but the bird vocalizations are very strong.
Figure 6d is a healthy apricot tree. The recording is taken under heavy wind and rain. All impulses are due to weather conditions and shaking of branches and leaves that result into vibrations. Healthy trees in calm weather may register occasional impulses (but not trains of impulses) that are due to tree metabolism related to humidity levels and dilations. Borers create a characteristic repeated pattern in the form of a pulse train and not isolated events.
In
Figure 7, we compare the spectral profiles of long duration recordings taken from parts of an infested and a non-infested trunk carried in a silent room of a laboratory (i.e., there was no background noise). A different borer in a different tree could create acoustic emissions with a different spectral profile; nevertheless, it would not be flat like the non-infested one. The power spectral density (PSD) one-sided estimate, of each recording sampled at 8 kHz is found using Welch’s overlapped segment averaging estimator. To elaborate further, the signal is divided into sections of length of 512 samples. The modified periodograms are computed using a Hamming window of the same length as the window. The overlapping in windowing equals to 50% of the window length.
3. Results
3.1. The Database
The device is a seismic sensor and records vibrations from a substrate. In the context of this work, the substrate is wood. It is quite straightforward to acquire recordings in acoustically challenging conditions (i.e., due to background interference) from trees that are not infested by borers. By simply inserting the probe in trees known to be healthy (not necessarily mulberries), one may easily get most of the typical sources of background vibrational interference (traffic, vocalizations, wind etc) while avoiding any vibrational signals due to X.chinensis. Although, the correct control is to use non-infested mulberries to gather vibrations due to background noise, externally induced vibrations are propagated mostly through the external part of the device and the substrate, in this case, does not alter the validity of the recordings. It is more complicated, however, to get recordings from infested mulberries, as the ultimate way to verify infestation is to cut down the tree— which is generally illegal in public spaces except for the authorized phytosanitary personnel—and slice it until one finds the larvae.
We gathered the recordings from infested mulberries in two ways: (a) by attaching the device on trees that had serious visual signs of attack and manually verifying the existence of pulse trains from the audio and visual inspection of spectrograms, and (b) from mulberries that had been cut down with permission by authorities (heavily infested trunks or dead trees).
The database is composed of 33 folders with audio recordings taken from 35 different trees and a corresponding annotation csv file. This corresponds to roughly a folder per tree. The folders contain emitted recordings over a period of 6 months. The recordings are in wav format but are actually decompressed after being received in an ogg format. The sampling frequency is 8 KHz. The first 27 folders are used for training and validation and the last 6 for testing.
The data set of the target insect is composed of 4165 field and 53,676 laboratory recordings mostly at 20 s. Training Folders: Infested (train pulses from borers) 1-6, 11-23, #recs 731. Clean: 7-10, 24-25, 35, #recs 1754. Total training data #recs 2485. Test Folders: #26-#34.
3.2. Deep-learning as Applied to Spectrograms of Vibro-acoustic Signals
Deep learning (DL) architectures have a modular layer composition where the layers close to the input learn to extract low-level features and subsequent layers rely on the previous layer(s) to synthesize patterns of higher abstraction (e.g., starting from edges and textures and ending in objects) [
29,
30,
31]. As it is impractical to listen manually to hundreds of thousands of clips transmitted to a cloud server from a large number of trees, there is need for an automatic process that screens these recordings. Deep learning techniques can provide fast classification (as rule of thumb 5 ms/recording in a single GPU), as they can discern between train pulses originating from borers and events from other external sources of vibrations (as human listeners can). We achieve that by transforming the audio recording to an image through the spectrogram (i.e., the Short-time Fourier Transform is a 2D representation like an image) and feeding the images to a DL model. In the case of the spectrogram the ‘object’ is a spectral blob that corresponds to a vibration source. It is important in our case to not only detect trains of pulses originating from borers but also learn to discern between impulsive events belonging to different sources, which vibrate the tree, although they are located outside it. The operational model calculates the probability of infestation of a tree based on a long history of recordings that can span weeks.
3.3. Verification Experiments
We performed 10-fold validation cross-validation on field data to estimate how different convolutional neural networks (CNNs) models are expected to perform in general when used to make predictions on data not used during training. The procedure had a single parameter k = 10 referring to the number of groups that a given data corpus was to be split into. Each group, in turn, was held out as a test data set and the remaining groups made the training data set. We fitted a model on the training set and evaluated it on the test set. The accuracy over each fold was measured and the mean score over 10-folds along with the standard deviation is reported in
Table 1. We applied a type of data augmentation with rolling of recordings at a random point to randomize the point in time the impulses appeared. In this work, our aim is not to fine-tune the hyper-parameters of the classifiers through grid-search. The images used to feed all CNNs are the spectrogram of the recordings using an FFT size of 256 and 50% overlap, resulting to a 129 × 1251 matrix.
We compared a set of state-of-the-art deep learning models to find the best-performing model that is most generalizable, has the least loss, and is the most suitable to be embedded for the task to be performed. In
Table 1, we give emphasis on models with small memory imprint (EfficientNetB0, MobileNet) with a view to embedding them in the probes instead of running them on the server level. It can be seen that, among the five models, the EfficientNetB0 and the MobileNet compare favorably to the larger models, while the best scoring Xception had the best convergence and training performance.
To further elaborate on the verification accuracy we use precision, recall and
F1 score metrics on a random 20% holdout data for the best performing model (see
Figure 8 and
Table 2). Precision (
P) is defined as the number of true positives (
Tp) over the number of true positives plus the number of false positives (
Fp). Recall (
R) is defined as the number of true positives (
Tp) over the number of true positives plus the number of false negatives (
Fp). These quantities are also related to the (
F1) score, which is defined as the harmonic mean of precision and recall.
High precision relates to a low false positive rate (false alarm), and high recall relates to a low false negative rate (miss). High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall). We did not try to fine-tune classifiers through grid-search and voting schemes of different models as optimization of classifiers is not the focus of this work.
Finally, in
Figure 9 we demonstrate how automatic assessment on the infestation status of a tree takes place once the CNN is operational: the probed tree provides a folder of snippets spanning a time interval and this folder is directly fed to the trained CNN with spectrograms of vibrations being the input and probability of infestation the output. Probabilities are averaged for all snippets and normalized to unity by diving with the number of snippets.
4. Discussion
4.1. The Practical Value of Knowledge
Pathway management and phytosanitary methods are the first line of defense to prevent or reduce the risk that non-native species are inadvertently introduced to new places via association with imported goods. Phytosanitary interception at commodity entry points (e.g., airports, harbors, stations, lorry parks, cargo depots and quarantine facilities) follows [
1,
2,
3,
4]. Wooden pallets, wood products, ornamental trees, plants but also cargos of fruits and other agricultural products are typically examined before importation using visual inspection and various technological means [
10,
15]. Effective interception of potential pests including but not limited to quarantine species already intercepted in the past, is crucial [
4]. Though not impossible, it is increasingly difficult to achieve eradication of establishing or established invasive species after initial arrival. Interception is currently based on visual inspection and manual application of several technologies.
This work introduces the novel service of automatic screening of wood-related imports. In short, devices are attached to the trees in storage facilities, the vibrational soundscape of the trees is sampled for the whole quarantine period and clearance is provided automatically after deep learning models have finished screening the vibrational record of the shipment, otherwise the cargo is returned to the sender. As it does not involve human attendance (one can attach the device and leave), it can be applied to a larger scale than it is currently done. In addition, since it integrates a longer time span of observations than the human service currently applied, it is anticipated that it will be more accurate. However, we need to study further the application of pallets monitoring in practice and the data of this work do not directly address this application.
Another service that currently does not exist is based on transmitting the systematic registration of vibrations to cloud services. The audio data serve as a permanent record of evidence and the process of cross-examination by trained bioacousticians is decentralized in the sense that the trees under investigation, the stored audio records and the human specialists need not be in the same place—pretty much as the way telemedicine is applied.
Due to current manual limitations, only 2%/year of incoming shipments is inspected in US [
4]. Therefore, more often than not, invading species are not intercepted at commodity entry-points and—as an example family—
Cerambycidae beetles are establishing in new locations. Post-border surveillance and containment is easier if the first establishment of the invasive species is detected and localized as early as possible. Forests and parks nearby commodities’ entry points are most at risk. If the invasive species attack trees of urban ornamental greenery in public spaces, like in the case of
X. chinensis for mulberries and
Rhynchoforus ferrugineus (curculionidae) for palms in Crete, the trees are left untreated until they die without consideration of their aesthetic value [
2]. Even in such a case, the automatic screening of vibrational records from trees offers new services and introduces a possible revision of the currently applied protocol. Regarding urban spaces, workers in ornamental greenery assess visually whether the trees already have exit tunnels, discoloration/damage of leaves, signs of rotten tissue and any other visual symptoms of health decline and cut down only the ones that are heavily infested or dead. However, this is too late: visual symptoms appear 1–2 years after the first infestation as regards
cerambycidae/
curculionidae, which means that by the time their traces are visible, the borers have completed several generations inside the tree and have escaped to infest new ones. What we suggest is to remove the trees with positive acoustic records and not to base inspection and assessment on visual records. Even if no other treatment is applied, this procedure is expected to delay the degradation of urban greenery relying on the specific tree species.
Let us give a lucid example on the dilemmas phytosanitary personnel face on a daily basis and how these can be answered with automatic screening of vibrational records. Should we cut down a mulberry without any visible signs of degradation knowing that the city is infested with X. chinensis and the Morus tree is the primary host? The decision to cut down trees is of grave importance both in terms of financial cost (i.e., removal and secure destruction costs) and in terms of ecological impact. During the experiments of this work, pest specialists would refuse the cutting of trees having no symptoms of degradation. Yet, there have been cases where an examination on the upper-part of the trunk has shown long and vivid pulse trains of vibrations. Again, recordings can serve as evidence and the pulse rate can assess the infestation status (heavy or low). Removing the tree will locally degrade the greenery but the alternative is to remove it dead, 2–3 years later, while escaping adults and their descendants will have infested a large number of healthy trees thus accelerating the degradation on a regional level. On the contrary, removing it a year and a half prior to the visual symptoms will significantly prolong ornamental greenery even if no other treatments are applied.
A different protocol may apply in the cases of trees of economic importance like orchards of stone fruits as heavy infestations lead to fruit drop. In such cases, the usual procedure is the removal and the immediate destruction of all infested trees, as well as those present within a variable radius of the infestation. The decision, however, to characterize a tree as infested is again based on visual signs. As mentioned above, this approach has poor effects because when visual signs are prominent enough to characterize a tree as infested, many generations of adult pests have already escaped. Therefore, removal of trees based on visual assessment of symptoms is not sufficient to stop the invasion to new areas, and to limit the damage where pests are already established. When borers are established, pest control may involve aerial and ground bait pesticide sprays, but their efficiency depends on knowing the time and location of insect infestations as early as possible. The advantage of probing the trees is that they can reveal the problem as early as first-generation larvae and automatically tag their location (the transmitting device carries a GPS).
4.2. Scaling up the Deployment
The presented approach can be applied to a forest setting to track the spread of forest infestation as well as in an urban space around a port to detect releases before they reach the level of infestation. The only obstacle is the cost of using a large number of devices. Each tree requires one device but as in the case of insect traps that sample the insect fauna, one device per cluster of trees can also be the practical choice. The exact number of devices can be derived by experimentation on specific scenarios. The cost of the device can drop when 5G technology is established and the currently expensive 4G modem is no longer needed. The system is durable in the face of wind and rain as its case is waterproof and can be deployed for a long time (months) as it is self-sufficient in terms of power due to its low-power electronic design and its small, embedded solar panel. Currently, the device uploads short recordings on the server where the deep learning methods classify the data. This approach has the advantage of logging permanent recordings as evidence and allowing us to use elaborate machine learning models that operate on the server where there are no power or memory constraints. The drawback is the cost of one modem per device. Alternatively, the graph of the deep learning model can be embedded in the device so that the decision on the infestation status is taken on the device and not on the server. This approach would have the advantage of not having to upload the recordings to the server. Moreover, binary decisions do not require a large bandwidth to be transmitted—as in the case of recordings—therefore, a long range, low-power protocol (LoRa) can be used so that many devices can form a network of their own that needs only one gateway to report collectively all measurements to the server. Long-term deployment with a frequent reporting schedule can be hampered by long periods without sun. In such case, the sampling and reporting updates need to be adjusted due to power limitations. It is not advised to rely on measurements taken during a heavy rain or storm because of the strongly vibrating branches trunks. In the Results Section, we investigate classification models with a small-memory imprint in view of embedding them in the device. This direction will be investigated in a future work.