1. Introduction
Cancer Alley spans 85 miles in southeastern Louisiana, stretching from New Orleans to Baton Rouge along the Mississippi River, with a population of approximately 45,000 [
1]. This region hosts around 150 plastic plants, chemical facilities, and oil refineries, and this number continues to grow despite the evident environmental impact. The air in Cancer Alley is characterized by toxic emissions and ranks among the most polluted in the United States [
1]. Approximately 50 toxic chemicals, including benzene, formaldehyde, ethylene oxide, and chloroprene, contribute to air pollution, with chloroprene being particularly concerning. The pollution in Cancer Alley has severe consequences for residents, many of whom eventually require nebulizers for survival. The recent coronavirus pandemic further exacerbated their plight because of their compromised health. Despite efforts by the Environmental Protection Agency (EPA) to regulate industries in the area and enhance the living standards of its residents, individuals in this region still face a 95% higher risk of cancer from air pollution compared with the rest of America [
2,
3]. The potential for catastrophic damage to land, marine, and coastal ecosystems underscores the importance of early detection and cleaning of oil and chemical spills in Cancer Alley to minimize environmental harm [
4,
5].
Chemical spill incidents exhibit a distinctive appearance in satellite images generated by Synthetic Aperture Radar (SAR) technology. This distinctiveness arises from the short gravity waves they induce, altering radar backscatter intensity and creating unique dark formations in SAR images [
6,
7,
8]. Exploiting this characteristic enables the segmentation of the resulting SAR images and facilitates the training of a neural network model on the acquired data. The field of segmentation, as outlined in the literature [
9], encompasses various types, including foreground segmentation, panoptic segmentation, semantic segmentation, and instance segmentation. Real-time segmentation primarily aims to predict masks over objects within an image frame with low latency.
Image segmentation is the process of dividing an image into distinct segments, thus enhancing the ease of analysis and comprehension [
9]. This technique finds applications in critical fields such as healthcare, transportation, and pattern recognition. Various image segmentation algorithms fall into categories such as basic threshold-based, graph-based, morphological-based, edge-based, clustering-based, Bayesian-based, and neural network-based segmentation. Each of these algorithms comes with its own set of advantages and disadvantages, tailored to specific applications. Numerous studies have explored the topic, including the recent publication by A. Kirillov et al. [
9], which introduces the “segment anything” framework by Meta AI. Their study implements a prompt-based segmentation tool trained on the most extensive segmentation dataset to date, utilizing 256 GPUs. The SA-1B dataset, created using Meta’s custom data engine, comprises 1 billion masks and 11 million images collected from various countries and continents worldwide. Models trained on the SA-1B dataset demonstrate a capacity to generalize across a wide range of data. However, the model, a transformer model, has notable drawbacks, requiring a substantial amount of energy for training. Additionally, the study reveals that training accuracy improves with larger datasets, indicating a need for more data to train more precise models [
10]. In another study, Olaf et al. [
11] developed the U-NET, a specialized neural network for image segmentation tasks. The training strategy in their study relies on data augmentation during training, enabling effective training on a minimal number of images compared with the previous study [
10]. The U-NET achieves high accuracy of approximately 92 percent and 77.56 percent when trained on the PhC-U373 and DIC-HeLa datasets, respectively, for image segmentation tasks. Subsequently, Oktay et al. introduced the Attention U-Net [
12], a more energy-efficient adaptation of the U-NET. This model efficiently learns to focus on target structures of varying shapes and sizes within the dataset, maintaining prediction accuracy without significant energy costs. In addition to these segmentation-focused studies, others, documented in references [
13], have developed neural network models specifically for monitoring oil spills, while [
14,
15] delve into the development of neural network models for healthcare-related applications.
In this research, we employed neural network models based on the U-NET architecture for image segmentation, specifically targeting chemical and oil spills. Our model was trained using the CSIRO dataset and the Oil Spill Detection dataset. Furthermore, we introduced mixed precision to streamline the model training process, optimizing data throughput on both the CPU and GPU platforms. As an additional acceleration strategy, we advocate for the adoption of FPGA architectures, leveraging frameworks like the FINN Xilinx framework [
16,
17] and HLS4ML [
18] to synthesize bitstreams for machine learning models quickly. The structure of this manuscript unfolds as follows:
Section 2 provides an expansive description of the segmentation approach employed in this study.
Section 3 delves into the intricacies of neural networks, while
Section 4 elucidates the concept of mixed precision.
Section 5 elucidates the methodology applied in training the neural network models under investigation.
Section 6 unveils the preliminary simulation results conducted on both CPU (central processing unit) and GPU (graphics processing unit). Following that,
Section 7 outlines our FPGA optimizer architectures and presents the corresponding simulation results on the FPGA platform.
Section 8 expounds on the results and challenges of this study. Finally,
Section 9 serves as the conclusion of this study.
2. Segmentation of Chemical Spills
Segmentation is a computer vision task that involves the classification of pixels within an image into classes. The image segmentation task involves extensive pixel-based processing, emphasizing the necessity of a thorough understanding of the data before selecting an appropriate model. Preceding the model training phase, we conducted a detailed examination of the images to enhance our comprehension of the datasets used in this study.
Figure 1 illustrates the color distribution of a randomly selected open-source sample of an RGB image and a sample from the Oil Spill dataset across both the RGB and HSV color spaces. The diagram showcases the color distribution of our data sample compared to a typical RGB image. As shown in
Figure 1, color spaces offer insight into the dispersion or concentration of content across the color channels of our images. Leveraging this understanding, we can determine the most suitable technique for segmenting different components of the image. In
Figure 1, the distribution of samples from the Oil Spill dataset follows a linear pattern, differing from that of the golden fish. However, it displays varying color intensities, offering guidance on the optimal approach for crafting a segmentation model to capture the various segments of the images.
Semantic segmentation and instance segmentation stand out as the two predominant forms of segmentation used today. In instance segmentation [
9], individual objects are identified and segmented within an image, with each instance assigned a unique label or color. Semantic segmentation, on the other hand, categorizes each pixel in an image into one of several predefined classes, where objects belonging to the same class share the same label or color. This contrasts with instance segmentation, which treats each object instance as a separate entity within the image. For this study, semantic segmentation is employed.
3. Neural Network
In this research, we employed neural networks based on the U-NET architecture for the segmentation task. The U-NET architecture, as described in previous works [
11,
12], utilizes techniques such as data augmentation, convolution, pooling, upscaling, and downscaling to achieve its distinctive U-shaped network structure. Because of its ability to attain high training accuracy in a shorter time, the U-NET is well-suited for large-scale oil and chemical spill detection, offering a more power-efficient training process compared with transformer models.
Presently, various neural network models are utilized in segmentation tasks, including the robust SegmentAnything transformer model by Meta AI, the Vanilla U-NET model, the Attention U-NET model, and others. However, transformer models are less power-efficient for this specific task, requiring extensive training on large datasets to achieve comparable accuracy to the U-NET, which can achieve satisfactory results with minimal training. The U-NET implementation in this study is tailored to accommodate the distinct datasets, adapting to variations in image sizes across the datasets.
Architecture
The U-NET architecture adopted in this study closely resembles the configuration described in the previous work [
11]. However, unlike the previous work [
11], our study employs oil spill datasets for training the model. Illustrated in
Figure 2, the U-NET consists of both a contracting and expansive path. The contracting path iteratively employs convolution, followed by rectified linear unit (ReLU) and max-pooling operations. In our architecture, the convolution layers’ feature channels extract features in the form of feature maps, which are subsequently propagated down the network.
The transformer variation of UNET is named UNETR [
19]. In UNETR, the downscaling or encoding portion of the network is replaced with a transformer encoder, and the upscaling or decoding portion of the network maintains the U-shape, as shown in
Figure 3. The UNETR transformer encoder is directly connected to the decoder via skip connections, instead of an attention layer, at different resolutions to compute the final three-dimensional (3D) semantic segmentation output. Skip connections, just like in the UNET model, help the network preserve information about features from the original input at each convolution level. Unlike the local modeling capacities of convolutional neural networks (CNNs) transformers encode images as a sequence of 1D patch embeddings and utilize self-attention modules to learn the weighted sum of values that are calculated in the hidden layers.
The encoder shown in
Figure 3 below and in
Figure 1 [
19] comprises a positional encoding layer, a stack of encoding layers that make up the encoder, and the causal attention and feed-forward layers. The encoder reads input signals and generates representations of the input data after it has learned the sequence representations of the input volume and effectively captured the global multi-scale information. The decoder, on the other hand, generates output word by word, based on the output signal representation, in the form of tokenized patches generated by the encoder. The vanilla decoder model comprises a stack of decoder layers and a positional encoder. The decoder layers contain global self-attention/cross-attention and feed-forward layers. The encoder and decoder are used to build the transformer model. In this work, we replace the decoder section with a U-NET decoder.
Transformers used for image recognition tasks are commonly called vision transformers. Therefore, our UNETR architecture in
Figure 3 can be more precisely referred to as a vision UNETR model.
Figure 4 shows how images are encoded by vision transformers [
20] for image-related classification tasks. The input image is split into patches that constitute a linear sequence of tokens similar to words in the case of the Bidirectional Encoder Representations from Transformer (BERT) model [
21]. The Multiheaded Self Attention (MSA) block involves computing self-attention on each head and finally concatenating the results as shown in (1) to (13). The computation on each head can be parallelized. By observing the data structure of the transformer, we came up with our design of a hardware accelerator.
In the equations above, X in (1) represents the input, R denotes real numbers, H, W and D denote the height, weight, and depth of our image frames, while C represents the number of input image channels. Xv in (2) represents the flattened uniform non-overlapping patches version of X. P in (2) denotes the resolution of each patch, and N is the length of the sequence, estimated using (3). Epos in (4) denotes learnable positional embedding, while E in (5) represents projected patch embedding. K in (4) and (5) denotes the size/dimension of the embedding space. Z in (6)–(8) represents the output sequence of the query (q) and the corresponding key (k) and value (v) pairs. A in (9) represents the attention weights/scores, and k represents the key. Kh in (10) is the scaling factor used to maintain the number of parameters to a constant value with different key values. In (11), v denotes the values of the input sequence and is used to calculate SA in sequence z. MSA represents multiheaded self-attention and is represented by (12). Wmsa in (13) is the multiheaded trainable parameter weights.
4. Mixed Precision Architecture
The mixed precision architecture is an optimization technique harnessing the computational power of GPU cores, resulting in 2 to 4 times faster computation and a 50% reduction in memory usage. This approach creates a potent compute engine without necessitating alterations to the hardware architecture [
22]. Specifically, Volta cores in NVIDIA GPUs, with a data throughput of 123 teraflops, experience significant benefits from this architecture [
22]. By employing 16-bit precision instead of 32-bit precision, computing throughput in Volta cores can be enhanced by a factor of 8, memory throughput can be doubled, and the data unit input size can be halved [
22].
In our implementation, we opted for mixed precision over a constant 16-bit precision to address potential imprecision in weight updates associated with FP16. This precision choice is crucial, as cumulative errors could significantly impact the final predictions. Mixed precision allows us to achieve nearly the same training and prediction accuracy as FP32 without altering hyperparameters. NVIDIA libraries, optimized for tensor cores, derive significant advantages from this architecture [
22].
Major machine learning frameworks like PyTorch and TensorFlow have seamlessly integrated the mixed precision feature into their frameworks, facilitating the implementation of automatic mixed precision with just a few lines of code, as illustrated in
Figure 5 and
Figure 6. For further customization, the mixed precision method can be manually added to different sections or lines of code.
In our framework, we utilized the APEX AMP (automatic mixed precision) PyTorch extension to implement mixed precision seamlessly with minimal code in Nvidia A100 GPU.
Figure 6 illustrates the concept of mixed precision, where FP16 and FP32 values are cast to preserve accuracy. A scale factor of 128, commonly used for loss scaling, is employed to maintain values and accuracy, serving as a constant in our study.
7. FPGA Accelerator
Integrating an FPGA accelerator serves the purpose of power optimization, and improved streaming efficiency. The SegmentAnything model, for instance, employs around 250 GPUs during training, resulting in significant power consumption. In this study, we save our model in ONNX format for compatibility with FINN. The FINN framework [
16] incorporates the Brevitas library, allowing the generation of FPGA accelerators using pre-trained models. ONNX, as an open-source format, is employed to represent machine learning models. The FINN framework takes the ONNX file and generates an FPGA model for each layer of the network, establishing communication between layers through AXI streams.
The Brevitas framework [
34], which works with the FINN builder, is used in the development of an FPGA accelerator for our model. The Brevitas framework is a PyTorch library for neural network quantization, with support for both post-training quantization (PTQ) and quantization-aware training (QAT) [
34]. It offers quantized implementations of the most common PyTorch layers used in deep neural networks (DNNs) under brevitas.nn. This includes QuantConv1d, QuantConv2d, QuantConvTranspose1d, QuantConvTranspose2d, QuanMultiheadAttention, QuantRNN, QuantLSTM, and several others. For each of these layers, the quantization of different tensors (input, weight, bias, outputs, and other factors) can be individually tuned according to a wide range of quantization settings [
34]. Brevitas enables fine-grain quantization-aware training [
16].
Another tool used to generate our accelerator is ONNXRuntime [
35], which is used for integration with standard ONNX-based toolchains. ONNX is prebuilt in PyTorch as torch.ONNX. It also has the transformers.onnx package, which converts transformer models to ONNX-format models. Open Neural Network eXchange (ONNX) is an open standard format for representing machine learning models. This module captures the computation graph from a native PyTorch torch.nn.Module model and converts it into an ONNX graph, which can be exported and consumed by several runtimes that support ONNX. The ONNX standard supports down to 8-bit quantization, but another version named Quantized ONNX (QONNX) supports expressing down to 1-bit quantization for both weights and activations [
16].
Finally, the FINN framework [
16] is a quantization-aware framework used for the generation of custom FPGA dataflow accelerators or to register transfer language models (RTL model). It is designed to work with the ONNX model. FINN uses ONNX as an intermediate representation for neural networks, as such almost every FINN component uses ONNX and its Python API. FINN supports two specialized variants of ONNX, namely, QONNX and FINN-ONNX. FINN also provides a ModelWrapper class, a thin wrapper around the ONNX model to make it easier to analyze and manipulate ONNX graphs. This wrapper provides many helper functions, while still giving full access to ONNX protobuf representation.
FINN supports three types of mem_mode attributes for the node MatrixVectorActivation [
16]. This mode controls how the weight values are accessed during the execution phase. The mode setting has a direct influence on the resulting circuit. The three settings for the mem_mode supported in FINN are “const”, “decoupled”, and “external”. Each comes with its own advantages and disadvantages.
Figure 11 shows the design flow employed in the design of our accelerator.
A significant challenge encountered during our model development was ensuring the proper functioning of the software stack. We attempted to utilize HLS4ML [
18] as an alternative to FINN, primarily designed for Keras, but faced similar compatibility issues. Both frameworks exhibited instability during accelerator development. However, they hold promise for significantly enhancing the speed of FPGA bitstream development for machine learning models in the future, given their ongoing development. To overcome the hurdles associated with developing and verifying bitstreams using FINN and HLS4ML, we opted to design accelerators for the UNET and UNETR models from scratch using High-Level Synthesis (HLS). The resulting accelerators are depicted in
Figure 12 and
Figure 13.
FPGA Inference Results
We generated an FPGA design for our model via HLS and verified the design on the Pynq Z1 board. Currently, resource usage by our FPGA design suggests low power consumption by the Pynq Z1 board [
36].
Table 2 shows the resource usage profile of our UNET and UNETR models, as well as the inference latency achieved.
8. Discussion
The semantic segmentation technique utilized in this study assigns a class label to each pixel in the image samples from our dataset.
Figure 8 illustrates the names of the various classes present in our dataset, while
Figure 14 reveals that the ocean (background) constitutes most of the samples. This distribution suggests that our models are more likely to predict class 0 (ocean) accurately because of its predominance among the samples compared with the other classes.
In
Figure 15, we evaluate our classification model’s performance using a confusion matrix, specifically focusing on the UNET model. The results indicate that all classes perform well except for class 3 (ship). The UNET model struggles to distinguish between the ocean (class 0) and ships (class 3), frequently misclassifying ships as the ocean. This issue can be attributed to the fact that class 3 has the fewest samples (22,981) in the dataset, as shown in
Figure 14, which may be insufficient for the model to generalize effectively with only 50 training epochs.
Figure 16 illustrates the differences between the test mask and the predicted mask after training the model for 50 epochs. Finally, we performed inference on FPGA and displayed the results in
Figure 17. In the real world, the impacts of chemical spills and contamination are not only prevalent in Cancer Alley but also in other less-developed parts of the world. The results of this study will have far-reaching implications and reduce the cost of monitoring contamination and effectively detecting chemical spills.
To compare the performance of our model, we looked at related studies that applied neural network models for image segmentation of an oil spill dataset or a related dataset, as shown in
Table 3. C. Li et al. [
37] perform image segmentation using dual stream U-NET (DS-UNET) on two datasets, namely, the Palsar and sentinel datasets. Their study further measures model performance according to three metrics, namely, the dice similarity coefficient (DSC), the average Hausdorff distance (HD), and the F1 score. Another study by A.V., Maria Anto et al. [
38] uses a convolutional neural network (CNN) for oil spill detection. They achieved 85% testing accuracy. The study by J. Fan and C. Liu [
39] addresses two problems including the scarcity of sufficient oil spill data and the difficulty in detecting oil spills in an environment where there is an oil spill look-alike. Their study [
39] uses multitask generative adversarial networks (MTGANs) to detect and semantically segment oil spill data. They applied their model to three datasets, namely, the Sentinel-1 dataset, ERS-1/2, and the GF-3 Satellite datasets. In [
40], the study by X. Kang et al. uses a self-supervised spectral–spatial transformer network (SSTNet) for feature extraction using custom hyperspectral oil spill database (HOSD) data. The training technique applied in this method involved a large number of training epochs to achieve a model that can generalize with high accuracy. Another study by J. Fan et al. [
41] built a framework using a multi-feature semantic complementation network (MFSCNet) for oil spill localization and segmentation of SAR images obtained via Sentinel-1 satellite data. The study by Mahmoud, A.S. et al. [
42] applies a novel deep learning UNET model based on the Dual Attention Model (DAM). This model, named DAM-UNet, integrates a dual attention model to selectively highlight the relevant and discriminative global and local characteristics of oil spills in SAR images. It does this using a channel attention map and a position attention map. Finally, Dong et al. [
43] propose the application of three deep learning-based marine oil spill detection methods, namely, a direct detection method based on transformer and UNet, a detection method based on Fast and Flexible CNN (FFDNet) and TransUNet with denoising before detection, and a detection method based on integrated multi-model learning. The performance benefits of the proposed method are then verified by comparing them with semantic segmentation models such as UNet, SegNet, and DeepLabV3+. When compared with our work, these approaches mostly require more training to obtain better accuracy as shown in
Table 3.
Apart from FPGAs, other alternative hardware used to perform inference includes various Application Specific Integrated Circuits (ASICs) and neuromorphic hardware for event-based datasets. Since this study focuses on CPUs, GPUs, and FPGA,
Table 4 compares results from related FPGA implementations for image segmentation using UNET or other related networks.
When performing machine learning inference on images using specialized non-reconfigurable hardware, latency and throughput can present significant challenges. FPGAs address these issues effectively because of their ability to be reconfigured and programmed with different architectures, thereby enhancing inference performance without requiring new hardware purchases. Additionally, FPGAs consume less power compared with GPUs and CPUs. These benefits make FPGAs the preferred choice for resource-intensive tasks like image segmentation, as demonstrated in this study.
9. Conclusions
In this study, we utilized the UNET and UNETR neural network architectures to perform semantic segmentation of chemical spills, leveraging two distinct datasets. This study has profound application in the real world as it can be challenging to detect and separate oil spill look-alikes from actual oil spills in the field. A notable aspect of our work is the development of reusable labeled ground truth images specifically tailored for the CSIRO dataset, a task previously unexplored. Our implementation integrates mixed precision techniques to enhance computational efficiency across both CPU and GPU platforms. Furthermore, we engineered an FPGA optimizer for the neural networks using High-Level Synthesis (HLS). Despite initial setbacks with tools like FINN and HLS4ML, we successfully devised a custom FPGA implementation using Vivado HLS. Our findings reveal a significant discrepancy in resource utilization between the UNETR and UNET models, primarily because of their divergent sizes. Consequently, the implementation of UNETR necessitates targeting alternative Pynq (software)-compatible FPGA boards boasting ample LUTs and DSP resources, such as the ZCU 102 and Alveo boards. Ultimately, our experiments demonstrate that the UNET model surpasses the UNETR model in terms of prediction accuracy on both CPU and GPU platforms. Moreover, owing to its more efficient resource utilization, the UNET model emerges as the preferred choice for this task. Finally, the results obtained from our study demonstrate improvements in inference latency on FPGA and ~94% prediction accuracy using UNET and ~77% prediction accuracy using UNETR.