1. Introduction
Measuring fish morphological features and observing their behavior, are important tasks that have to be carried out daily in fish farms. The purpose of these tasks is to assess the growing conditions and the welfare of the fish, their health and feeding needs as well as to decide the optimal time for harvest. Morphological feature measurement includes the estimation of body dimensions and mass. Indicators of fish health include eye diameter and color, gill color as well as malformations in the shape. The prompt diagnosis of fish disease infections is necessary for lower cost treatment that will prevent the spread of a disease to the whole population of the aquaculture. The trajectory, speed, sudden changes in the orientation, etc., can be indicators of fish behavior. For example, sluggish or indolent fish movements may indicate that a fish is sick or simply not hungry while hyperactive fish that are afraid to reach food, may indicate that they are stressed, or feel threatened. Until recently, the measurement of fish dimensions, weight, etc., was performed manually and invasively, by taking sample fish out of the water. This is a manual, time-consuming, high-cost, inaccurate and harmful procedure for the fish. Fish tracking is almost impossible with the naked eye without infrastructure to capture and analyze underwater videos of adequate duration. Tasks like fish tracking and classification, morphological feature estimation and behavior monitoring are also important in observing the population of various species in rivers or the open sea. For example, the population of a fish species can be estimated from the density of the detected fish in an image frame. A fish species variety can be identified from its shape and skin texture or color. Fish tracking can provide information about the behavior and potential stress of the fish in their natural environment. Estimating fish dimensions may also be useful in an open water environment for the assessment of a fish population condition. The morphological feature measurement is based on fish shape or contour alignment in order to measure the dimensions, detect malformations and locate parts of the fish that are of particular interest such as the eyes and the gills. Moreover, morphological feature measurement methods can also be exploited in fish product processing laboratories to estimate fish sizes, freshness, recognize species variations, etc.
A review of smart aquaculture systems is presented in [
1] where several processes are described for breeding, nursery to grow out stages of cultured species, preparation of cultured water resource, management of water quality, feed preparation, counting, washing the cultured systems, etc. Another review of computer vision applications for aquaculture is presented in [
2]. These applications include fish and egg counting, size measurement, mass estimation, gender detection, quality inspection, species and stock identification, monitoring of welfare and behavior. The trends in the application of imaging technologies for the inspection of fish products are examined in [
3]. The reviewed image-processing approaches are classified by the position of the light source (reflectance, transillumination, transflectance, etc.). The applications examined include rigor mortis, tissue and skin color as well as morphology to determine fat, firmness, shape, wounds, blood detection, etc.
The approaches for morphological feature measurement that have been proposed in the literature, are often based on the contour of a fish or on the localization of specific landmarks. Both image processing and deep learning approaches have been proposed that can either achieve a high frame processing speed or a high accuracy. The applications that achieve high accuracy in the estimation of morphological features are often tested on datasets with high-quality images where the fish are clearly visible. In these datasets, fish are captured either in a controlled laboratory environment or underwater with expensive cameras and good environmental conditions (minimal reflections, calm sea, no murky waters, etc.). Moreover, the size of the fish in these datasets is reasonably large to discriminate the details of their bodies. The motivation for this work was to adapt a popular human face shape alignment Machine Learning (ML) technique for fish shape alignment with hardware acceleration support in order to achieve both high speed and accuracy. Additional software modules and applications have been developed that either support the proposed morphological feature extraction method (e.g., orientation classification, landmark editor) or offer additional services such as fish tracking. Moreover, we also developed a dataset with low-contrast underwater images and videos, displaying relatively small fish in murky waters, with intense reflections and refraction, limited visibility, turbulent waters, etc. Thus, testing our framework with this dataset provided experimental results under worst case conditions.
In the system described in this paper, a three-stage approach is followed to detect fish in low-quality image frames where fish cannot be easily discriminated from the background. The input image frame is segmented to extract the bounding boxes of the detected fish as separate image patches. In the second stage, each patch is classified to a draft orientation in order to select the corresponding pre-trained ML model that can align a number of landmarks on the fish body. In the third stage, the shape (and potential malformations) of the fish can be recognized from the located landmarks, in order to measure fish dimensions, to classify fish and to map fish body parts of special interest (eyes, gills, etc.).
The first stage of the proposed system is based on the open source, fish detection, deep learning method presented in [
4]. Although detailed instructions are given in [
4] on how to train a customized fish detection model, the available pre-trained one performed very well even with the low-resolution images of our dataset. Therefore, the pre-trained model was stored on the target platform and was called from an appropriate Python script that has been developed for image segmentation. The output of this script is a number of image patches and each one of these patches contains a single fish. The coordinates of these patches in the original input frame are also extracted. Each one of the image patches is classified in a draft fish orientation category, following high-speed methods that are based on OpenCV [
5] services. The coordinates of the patches can be used to track the movement of the fish in successive frames. The short history of fish positions that have been detected in the first stage can be used to estimate extended bounding boxes for the intermediate positions through interpolation in order to bypass the time-consuming fish detection process in some frames.
The extracted patches from each image frame are used as inputs to the last stage of the developed system that performs shape alignment. Aligning a number of landmarks is based on the ML approach called Ensemble of Regression Trees (ERT) presented by Kazemi and Sullivan in [
6] and is exploited in popular image-processing libraries such as DLIB [
7] and Deformable Shape Tracking (DEST) [
8]. The DEST library was exploited in our previous work [
9] for driver drowsiness applications. The source code of the DEST library was ported to Ubuntu and Xilinx Vitis environments to support hardware acceleration of the shape alignment process on embedded targets. In the context of the present work, the DEST library has also been ported to Microsoft
® Visual Studio 2019 environment for fish shape alignment.
Previous work on fish morphological feature measurement has also been presented by one of the authors in [
10] but it concerned different fish species and employed different approaches for image segmentation, pattern matching and contour recognition. The fish contour detection method followed in [
10] was not based on shape alignment and ERT. Specifically, three methods were proposed in [
10] for fish contour recognition: (a) Pattern Matching (PM), (b) Mask R-CNN applied on a Binary Mask Annotation (BMA) or on (c) a Segmented Color Image Annotation (SCIA). An absolute fish dimension estimation based on stereo vision was presented in [
10] that is also applicable in the framework presented here. The approaches (BMA, SCIA) presented in [
10] exhibited in most cases, a much higher error in the estimation of fish dimensions, than the present solution. All the alternative methods (PM, BMA, SCIA) presented in [
10] required a much higher frame-processing latency (in the order of seconds) than the current approach thus, they could not be exploited for real-time applications.
In the present approach, the accelerated shape alignment method required for the morphological feature extraction showed a latency of a few ms on an FPGA platform or less than 0.5 μs on an Intel i5, 3 GHz processor. The error in landmark position estimation is in the order of 5% owed mainly to the low-contrast and quality of the images in our dataset. Advanced histogram equalization techniques for contrast enhancement can be found in [
11]. Since no clear reference photographs are available in our case, contrast enhancement could be achieved with No Reference methods like the one presented in [
12]. In [
12], information maximization is attempted by removing predicted regions (sky, sea) and estimating the entropy of particular unpredicted areas via visual saliency. From a global perspective, the image histogram is compared with the uniformly distributed histogram of maximum information to find the quality score of the applied contrast mechanism. Contrast enhancement methods will be employed in our future work to reduce the landmark position estimation error. In the present work however, we use the developed dataset with low-quality images without any enhancement to test the efficiency of our shape alignment method under worst case conditions.
The contribution of the present work can be summarized as follows: (a) shape alignment based on ERT models is adapted to fish shapes, for high-precision morphological feature estimation, (b) ERT models with different parameters are trained to find a tradeoff between speed and accuracy, (c) a different ERT model can be trained for each fish orientation, (d) a fish detection method efficient for low-contrast images is employed and adapted for local execution in the proposed framework, (e) fish tracking is supported exploiting interpolation and orientation classification results, (f) hardware and software acceleration techniques are implemented for shape alignment and others are also applicable for fish detection in order to support real-time video processing, (g) a new landmark editor has been developed to easily prepare the training set and ground truth data, and (h) a new public dataset with realistic photographs and videos has been developed.
This paper is organized as follows. The related work is presented in
Section 2. The materials and methods used are described in
Section 3. More specifically, the dataset, tools and target environment are presented in
Section 3.1. The general architecture of the proposed system is described in
Section 3.2. The employed fish detection and the fish orientation methods are described in
Section 3.3 and
Section 3.4, respectively. The methodology for implementing fish tracking is described in
Section 3.5. The ERT background and the customized DEST package used for fish shape alignment and morphological feature extraction are described in
Section 3.6 and
Section 3.7, respectively. The experimental results are presented in
Section 4. A discussion on the experimental results follows in
Section 5 and the conclusions are presented in
Section 6. All abbreviations used throughout this paper are defined in Abbreviations.
2. Related Work
Several approaches have been proposed concerning the estimation of fish freshness in a controlled laboratory environment, based either on sensors or image processing. In [
13], various sensors that have been used in the literature for freshness estimation are reviewed. These sensors include biosensors, electric nose or tongue, colorimetric sensor array, dielectric and various sensor for spectroscopy (nuclear magnetic resonance, Raman, optical, near infrared, fluorescence, etc.). Quality management systems have also been proposed for freshness, safety, traceability of products, adopted processes, diseases and authenticity [
14]. In [
15], the freshness of Scomber japonicus (mackerel) stored at a low temperature is assessed from the correlations between the light reflection intensity of mackerel eyes and the volatile basic nitrogen content. The assessment of fish freshness from the color of the eyes is also examined in [
16]. In this approach, a handheld Raspberry PI device is used to classify the freshness of a fish into three categories (extremely fresh, fresh, spoiled) based on pixel counting.
Fish classification and counting from underwater images and videos is another major category where several approaches have been proposed in the literature. In [
17], fish appearing in underwater images are classified in 12 classes based on Fast Regional-Convolutional Neural Networks (Fast R-CNNs). Similarly, in [
18], You-Only-Look-Once (YOLO) [
19] and Gaussian Mixture Models (GMM) [
20] are compared for the classification of 15 species with an accuracy between 40% and 100% (>80% in most cases). Lekunberri et al. [
21], count and classify various tuna fish species transferred on conveyor belt with 70% accuracy. Their approach is based on various types of neural networks (Mask R-CNN [
22], ResNet50V2 [
23]) while the size of tuna fish, ranging from 23cm to 62cm, is also measured. Underwater fish recognition is performed in [
24] with an accuracy of 98.64%. Similarly, fish recognition from low-resolution images is performed in [
25] with 78% precision.
Morphological feature estimation is often based on active and passive 3D reconstruction techniques. The active techniques are more accurate but require expensive equipment such as Lidars, while passive techniques employ lower cost cameras. Challenges of passive 3D reconstruction include the accurate depth estimation from two images that have been retrieved concurrently, occlusions, patterns and saturate areas that may cause confusion. In [
26], a system based on stereo camera is described for accurate fish length estimation and fish tracking. A monocular 3D fish reconstruction is presented in [
27], where successive images are used from fish carried on a conveyor belt in order to measure their size. CNNs implemented on Graphical Processing Units (GPUs) are used for foreground segmentation and stereo matching. A median accuracy of less than 5mm can be achieved using an equivalent baseline of 62 mm.
In [
28], Facebook’s Detectron2 machine learning (ML) library has been employed for object detection and image preprocessing to generate 22 metadata properties including morphological features of the examined specimens with error rates as low as 1.1%. Otsu threshold is used for segmentation of relatively simple images and pattern matching to locate the eye. If the fish is detected without an eye the images are up-scaled.
Fish tracking (and classification) can be performed with both optical and sonar imaging as described in [
29]. Using sonar imaging is the only way to monitor fish at night time. In this approach, the Norfair [
30] tracking algorithm in combination with YOLOv4 [
31] are used to track and count fish. The employed sonar equipment is dual-frequency identification sonar (DIDSON) that exploits higher frequencies and more sub-beams than common hydroacoustic tools. The use of DIDSON has also been described in [
32] for the detection of fish morphology and swimming behavior. In this approach, fish must be large enough and within an adequate distance thus, it is not appropriate for counting small fish. Fish length should preferably be around 68 cm, otherwise an estimation error ranging from 2–8% was measured for different size fish (40–90 cm). In [
33], optical and sonar images are also employed for fish monitoring.
4. Experimental Results
In total, four ERT models have been trained using the training set of 270 images described in
Section 3.1. These models differ in the number of cascade stages (
Tc) and the number of regression trees (
Nt) in each cascade stage, as described in
Table 4. The specific
Tc,
Nt parameter values were selected to quantify the accuracy degradation if the number of cascade stages or regression trees is reduced compared to the default model M1 that is expected to achieve the maximum precision.
The test set of
P = 100 fish photographs has been derived from the 52 test photographs of the initial 322 image dataset, using LAE augmentation services. The error in the position of the landmarks is estimated from the comparison with the annotation defined as ground truth in the LAE editor. The relative error
εri between the estimated landmark
position and its corresponding position
ki in the ground truth annotation is the Euclidean distance between these two positions, expressed in pixels:
If
and
and the image (width, height) is (
w,h), the normalized relative error for landmark
I, (
εni) is:
The standard deviation (SD),
σε in the distribution of the landmark estimation error
εni across all
L landmarks is:
where
με is mean error of
εni i.e.,
The standard deviation in the distribution of estimation error
εnij of a specific landmark
i in
P images (0 ≤
j <
P) is:
where
μi is mean error of
εnij, i.e.,
Another standard deviation metric (
σP) used is in the average relative error
μεj (see Equation (14)) of all landmarks of an image in all the
P test images:
where
μP is the mean of
μεj:
Table 5 shows the average, minimum and maximum, absolute and relative errors that have appeared in all landmarks and all the
P test images when the model M1 is employed for maximum accuracy.
The standard deviation
σε limits as well as the
σP deviation of the average error in the
P test images are listed in
Table 6.
The mean error
μi of each landmark
i, along with its standard deviation
σi is plotted in
Figure 12, for model M1. This plot is of particular interest because it highlights the landmarks that show the highest error.
The error in the relative height and length estimation for the default ERT model M1 of the fish in the test set is listed in
Table 7 along with their standard deviations.
Figure 13 can be used to compare the error shown by the ERT models of
Table 4, in the estimation of the fish length, height and the location of the eyes and gills from the corresponding landmarks.
Concerning the fish orientation methods described in
Section 3.4,
Table 8 lists the success rates achieved with each method. The PCA method is capable of recognizing the fish tilt with a very good accuracy (less than ±10° in more than 95% of the cases). However, the direction that the fish is facing is recognized with much lower accuracy as shown in
Table 8. COD performs a draft classification in left or right direction, while FOD performs a more detailed orientation classification in quadrants Q0–Q3 with the success rate listed in
Table 8. Eye template matching (TM) is also used to classify the direction in one of the Q0–Q3 quadrants. A number of combinations of these orientation classification methods are then tested.
In PCA + COD, the tilt found by PCA is used while COD is used to detect the direction. In PCA + COD (2), COD direction is taken into consideration only if the confidence is above a threshold. In PCA + TM, the coarse left–right direction indicated by TM is taken into consideration to decide the direction on the tilt estimated by PCA. If, for example, the tilt is from bottom-left to top-right, then Q1 is selected if TM finds the fish eye in the right quadrants (Q1, Q3). If the fish eye is found in the left quadrants (Q0, Q2), then Q2 is selected. In PCA + TM (2) method, the TM direction is considered only if the template matching found the fish eye in one of the quadrants that are compatible with the fish tilt indicated by PCA. For example, if the fish is facing up-right, its caudal fin is in Q2 and its head is in Q1. If the TM method finds the fish eye in Q1 or Q2 then, the direction indicated by TM will be assumed correct. Specifically, with fish eye found by TM in Q1 it will correctly be recognized that the fish is facing up-right while if the fish eye is found in Q2 it will be assumed by mistake that the fish is facing down-left. If the TM finds the fish eye in Q0 or Q3, the direction indicated by TM will not be taken into consideration and only the direction indicated by PCA will be used.
In
Table 9, a comparison can be found between our fish length and height estimation method and the references that present fish size estimation results. The last column of
Table 9 lists the frame-processing latencies (
Df) of the referenced approaches and our work.
5. Discussion
The error in the landmark position estimation as presented in
Table 5,
Table 7 and
Table 9 and
Figure 11 and
Figure 12 is largely due to the low contrast of the images in the employed dataset [
34]. Other referenced approaches [
23,
24,
25], are also tested with low-quality underwater images. However, in most cases they display fish that are more clearly visible than the images in the UVIMEF dataset. The fish in images from ImageCLEF/LifeCLEF dataset used in [
23], Fish4Knowledge [
24] and ImageNet [
25] are more distinct as shown in the example photographs of
Figure 14 (they can be compared to sample images from UVIMEF in
Figure 7b–e).
To measure the contrast in the images of a dataset, various metrics can be used. The Root Mean Square (RMS) contrast in an image
Im, with
Row × Col pixels is defined as:
where
is the average intensity of the pixels in image
Im. Another popular metric is the Michelson contrast that is based on the minimum (
Imin) and maximum (
Imax) pixel intensities in an image
Im:
The entropy of an image or of a specific region in an image measures the information incorporated is this region. Thus, entropy is also related with contrast since higher entropy indicates more information expressed as abrupt changes in the intensity of neighboring pixels. Entropy is defined as:
Pe contains the normalized histogram counts of the image
Im. A total of 20 indicative photographs have been selected from our dataset (UVIMEF) and the same number of images from ImageCLEF and Fish4Knowledge datasets. The average values of the contrast metrics defined above are listed in
Table 10. As can be seen from this table our dataset has the lowest contrast. UVIMEF has a much smaller RMS and Michelson contrast than the other two datasets. Concerning the entropy, only Fish4Knowledge has lower average entropy than UVIMEF. Moreover, the fish dimensions in UVIMEF photographs are quite small resulting in patches that may have extremely low resolution (e.g., 70 × 30 pixels). The low resolution and contrast of the images that serve as input to our shape alignment approach, pose much worse conditions for our experiments, compared with the referenced approaches.
From the experimental results presented in the previous section, the average relative error (using M1 model) in the alignment of a single landmark is 4.8% (corresponding to an absolute error of 17.54 pixels) with the following SDs: σε = 0.03, σΡ = 0.0215. The relative error in the estimation of fish length is 5.4% and 4.5% in the estimation of the height (with the corresponding SDs being 0.049 and 0.062, respectively). Taking into consideration that the length of the fish recognized in the photographs of the dataset ranges from 10 cm to 30 cm, the average absolute error is in the order of 0.5 cm–1.5 cm.
More details concerning the accuracy in the alignment of individual landmarks can be found in
Figure 12. Specifically, landmarks 7 (top of the caudal fin) and 9 (bottom of the caudal fin) are located with a mean error equal to 6.8% and 7.8%, respectively. These are the highest relative errors measured per landmark. They appear at the landmarks that mark the edge of the caudal fin because in most photographs of the UVIMEF dataset used for training and testing, the caudal fin perimeter is often indistinguishable from the background. Landmarks 1 (mouth) and 8 (middle of the caudal fin), that are used for the estimation of fish length are located with a mean error of 6.8% and 5.6%, respectively. When they are combined to estimate the fish length, the average relative error is 5.4%.
Landmarks 3 and 13 are used to estimate fish height. Their average relative error is 4.8% and 4.6%, lower than the error shown by landmarks 1 and 8 that are used for length estimation. For this reason, the error in fish height estimation (4.5%) is lower than that of the fish length (5.4%). Other landmarks of interest are No. 17 and 18 that are used to locate the fish eye ROI. These landmarks are located with an average relative error of 4.5% and 3.6%. Taking into consideration the fish size range mentioned above, this relative error in the fish eye localization is interpreted to about 0.4 cm–1.2 cm. In the experiments conducted, the fish eye was not always found between landmarks No. 17 and 18. Nevertheless, additional pattern recognition methods can be applied to localize the exact position of the eye in the neighborhood of these landmarks. Similarly, the position of the gills is another ROI located by landmarks No. 14, 15 and 16. The mean relative errors in the estimation of the position of these landmarks range between 3.8% and 4.7%.
In
Figure 13, the relative error in the estimation of four morphological features, by the ERT models listed in
Table 4, is displayed. More specifically, the error in the estimation of the fish length, height, as well as the position of the eyes and gills is compared. In all cases, the error of model M2 is slightly higher than that of model M3 and the error of M3 is slightly higher than the error M4. However, the error shown by the default model M1 is higher than that of M3 in the estimation of the fish height and the position of the fish gills. Model M3 seems to show an error comparable to that of M1 and can replace it, if higher processing speed is required. If the frame-processing latency of the default model M1 (
Nt = 500,
Tc = 10) is
Df, the following equation estimates the latency
of a different model with
trees and
cascade stages:
For example, the latency of M3 () is . M4 () is the model with the lowest latency: . Thus, M4 is expected to have 56.25% higher speed than M1.
Concerning the fish orientation classification,
Table 8 shows that the best results are achieved with the combination of PCA with TM when the false eye template matching estimations are ignored. PCA is capable of detecting the tilt of the fish with high accuracy. However, it could detect the direction of the fish in only 44.8% of the cases, using the low-contrast images of the UVIMEF dataset. The COD achieved a higher success rate (67.2%) but can detect only a draft left or right direction. The FOD method could be used to classify the direction of the fish in four quadrants but its classification accuracy is only 43.1%. On the other hand, fish eye template matching has a relatively higher success rate of 63.8%. This could have been even higher, if the resolution and the quality of the dataset images were better because in many fish image patches the eye is not visible at all. In these cases, the eye is confused either with background objects or with other parts of the fish like a strip in the caudal fin of some species such as Diplodus annularis (see
Figure 7). Certain combinations of these orientation detection methods were also tested as shown in
Table 8. Combining PCA with the left or right classification of COD achieved a success rate of 65.5%. The highest accuracy was achieved when the PCA was combined with TM and can reach 79.3%. A much higher orientation accuracy is expected to be achieved if the track of the fish is also taken into consideration as explained in
Section 3.5.
Comparing the fish size estimation methods listed in
Table 9 as well as the errors displayed in
Figure 13, it is obvious that the proposed fish length or height estimation achieves one of the best accuracies reported for morphological feature estimation. In [
26], a slightly lower error (5%) is achieved in the estimation of the fish length while in [
32] the error is lower only in some specific fish sizes. However, in most of the cases presented in the literature, fish size is estimated in a controlled environment (e.g., on a conveyor belt) or with high-resolution underwater images and clearly visible fish as described in
Figure 14 and
Table 10. Estimating fish size in low-contrast and resolution images like those generated from UVIMEF dataset, is a much more challenging task. It is obvious that all the errors listed in
Table 5,
Table 6,
Table 7 and
Table 9, as well as
Figure 12 and
Figure 13 would be lower if the ERT models had been trained with higher-quality images. It is also worth noting from
Table 9, that the accuracy in the present work is much better compared to previous work [
10]. The frame-processing speed of the current approach is also orders of magnitude higher than that of the previous work [
10].
In summary, the developed framework offers a number of useful services for fish monitoring such as morphological feature estimation, fish orientation classification and fish tracking. These services can be employed both for monitoring fish in free waters and aquacultures. The fish detection, orientation classification and shape alignment methods for morphological feature estimation were described in detail. The principles of fish tracking in the developed framework were also discussed.
One of the limitations of the current work is the latency of the fish detection. Specific directions were given to pipeline the fish detection in specific frames with other tasks that can run as parallel threads. These tasks can be the bounding box interpolation in intermediate frames between actual fish detections, the execution of the orientation classification and the shape alignment. Hardware acceleration of the shape alignment process was applied for embedded target platforms. Similarly, the inference for fish detection can also be implemented in hardware on the same target platform. Developing such an architecture is part of our on-going work in order to achieve a high frame-processing speed for the overall system and support real-time operation. Finally, more sophisticated techniques can also be incorporated into the presented fish-tracking approach. For example, feedback from the shape alignment stage can be exploited to identify with higher confidence the fish in successive frames without confusing their positions. The fish orientation can also indicate when the fish is changing direction in its track.