A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark

Wang, Ning; Chen, Fang; Yu, Bo; Wang, Lei

doi:10.3390/rs14071568

Open AccessArticle

A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark

¹

Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Beijing 100094, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

State Key Laboratory of Remote Sensing Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

⁴

Hainan Key Laboratory of Earth Observation, Aerospace Information Research Institute, Chinese Academy of Sciences, Sanya 572029, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(7), 1568; https://doi.org/10.3390/rs14071568

Submission received: 15 February 2022 / Revised: 16 March 2022 / Accepted: 21 March 2022 / Published: 24 March 2022

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Superpixel segmentation algorithms are widely used in the image processing field. The size of the large-scale images usually exceeds the memory of a single machine given that the size of image data has increased rapidly in recent years. This leads to big challenges for implementing sequential superpixel segmentation methods, although these algorithms have good scalability. Additionally, segmentation of large-scale images over a distributed cluster is a feasible solution. Nevertheless, it is challenging to transplant sequential superpixel algorithms directly to a distributed environment, as usually there are incomplete object problems in the border area of image tiles. To overcome the incomplete object problems, one approach is to build a distributed strategy based on a sequential SLIC superpixel segmentation algorithm over a distributed cluster organized by Apache Spark. In our research, the decomposed image tiles were divided into two categories—even tiles and odd tiles. The even tiles were first segmented by the SLIC algorithm, then the cluster centers and buffer sizes of even tiles were extracted and switched to odd tiles. During the shuffle stage, the odd tiles acquired pixels from adjacent even tiles according to the buffer sizes, and then the buffered odd tiles were segmented by the SLIC algorithm with the help of the shared cluster centers. The superpixels with shared cluster centers were generated in even tiles and remained in order to enlarge the odd tiles rather than redundant computing of specific areas to modify incomplete superpixels well. Specifically, this strategy employs the shared variables to transmit intermediate results and the shuffle operations were carried out among approximately half of the entire image tiles, which reduces the communications further. The distributed strategy was evaluated in terms of the accuracy and execution efficiency, which revealed that the proposed strategy could not only get better F-measure values but is also implemented faster relative to the repeat calculation strategy, especially for limited calculation resources. Therefore, the proposed strategy is more suitable for superpixel segmentation algorithms. In addition, this research accumulates experience for expanding the abundant sequential algorithms to the distributed environment and provides more solutions for large-scale image processing demands.

Keywords:

superpixel algorithms; distributed computation; image processing; Apache Spark; digital disaster reduction

Graphical Abstract

1. Introduction

As earth observation satellite technologies are developing rapidly [1], more and more advanced sensors may acquire a large amount of remote sensing images at a fast rate with higher spatial, temporal, and spectral resolution [2,3,4,5]. This leads to an exponential increase in the size of the acquired images, which enhances the abundance of images greatly as well as bringing many challenges for storing and processing the images efficiently [6,7,8]. On the one hand, important and meaningful scientific discoveries may be concluded from the research on analyzing very large amount of remote sensing images on a global scale and a long time series [9,10,11]. How to mine useful knowledge from the archived images efficiently becomes an intractable task. On the other hand, in practical applications, the processing demands in real-time large-scale image processing are also getting higher [12], especially in the cases of flood or earthquake disasters, which require processing of large-scale images as soon as possible [13,14,15]. Therefore, how best to process the acquired large-scale images in real-time is becoming more urgent and yet difficult [16].

To process large-scale images rapidly and efficiently in specific applications, the processing tasks should be carried out over a distributed cluster organized by big data frameworks [17] rather than in multiple independent single machines. This is because the big data framework can pool together the calculation resources of many single machines, and hence gives us the ability to use the cumulative resources as if they were a single computer [18]. Hence, besides a group of physical machines, a framework coordinating the tasks across multiple calculation nodes is necessary. The Google Earth Engine (GEE) [19] may be regarded as one kind of framework, which is designed specifically to process large-scale remote sensing images over the cloud platform [20,21]. In GEE, massive images are uniformly organized and processed with predefined norms [22]. It is convenient for users to use images from archived datasets without downloading or processing them locally [23]. However, GEE lacks the support of several specific remote sensing datasets, and its official image processing algorithms are limited with respect to object-based image analysis [24] as it focuses more on pixel-based image processing algorithms [25]. Hence, an alternative way to processing large-scale images is to adopt the commercial big data frameworks, such as the Hadoop [26], the Spark [27], and the Flink [28], to manage and coordinate the tasks over a distributed cluster. Then, we could develop functions more flexible and feasible customized according to various actual demands over distributed cluster with the help of mature frameworks. Hence, to transplant the abundant sequential algorithms to a distributed environment becomes a practical issue and is a feasible scheme to expand the capability of existing algorithms.

It is still a challenge, however, to transplant sequential algorithms directly to a distributed environment to deal with large-scale images. As the multiple calculation nodes could process the subset of large-scale images simultaneously, the computational complexity is approximately linear between the execution time and image volume under a distributed environment. Therefore, the sequential algorithms integrated into a distributed cluster could indeed improve the execution efficiency and enhance its scalability greatly [29,30], and acquire identical or similar results as those implemented in a sequential way. However, the sequential algorithms usually need to be reorganized and rebuilt to adapt to a new distributed implementation environment. In addition, incomplete objects usually exist in border areas of image tiles when object-based image analysis methods are implemented in parallel among multiple calculation nodes. The decomposition of large-scale images usually breaks up the objects located on border areas into several parts, which generates incomplete objects with some artifacts [31]. Moreover, these incomplete objects may lead to inaccurate propagations, which affect the subsequent applications and may limit the scalability. Hence, how best to solve the incomplete object problem efficiency is the key point for transplanting sequential algorithms to a distributed environment.

To solve the incomplete object problem, researchers have developed many related models based on redundant computing to modify incomplete objects during large-scale image processing. The scalable tile-based approach is exploited to process large-scale images and modify incomplete objects, hence Michel, Youssefi [32] and Lassalle, Inglada [33] separately presented the mean-shift and the region-merging methods based on the segmented images approach. Furthermore, Derksen, Inglada [34] proposed the simple linear iterative clustering segmentation (SLIC) algorithms according to the tile-based approach to organize pixels into superpixels in parallel. The parallel segmentation methods were built according to the marginally stable concepts to generate identical or similar results to the sequential ones. However, the marginally stable concept is not suitable for all segmentation algorithms, and the degrees of parallelism are not high enough for the parallel SLIC. Besides the former models, both Lin and Li [35] and Gu, Han [36] developed the parallel segmentation algorithms based on the minimum spanning tree model and the minimum heterogeneity rule partitioning technique to modify incomplete objects, but the message passing interface parallel technology is rather complicated for many researchers, as the users concentrate on not only the target algorithm itself but also the task schedules among the calculation nodes. In an attempt to focus more effort on algorithm modifications, Happ, da Costa [37] proposed a Hadoop-based distributed strategy for a region-growing algorithm to modify incomplete objects with the specific indexing mechanism and hierarchical stitching method, but its adopted MapReduce programming model is relatively slow. To accelerate the calculations, Wang, Chen [38] proposed a distributed image segmentation model to remove artifacts and modify incomplete objects by repeat calculations based on the faster Apache Spark model; however, the execution time was not recorded and the communication volume of Apache Spark was not optimized well. In order to deal with the aforementioned issues, Chen, Wang [39] improved the distributed segmentation model in terms of communication reduction by decreasing the auxiliary bands and detecting the buffer sizes automatically. However, the extra auxiliary bands not only waste the calculation memory but also generate large communications during the shuffle stage, and also limit the efficient processing of the distributed strategy. In addition, most of the described strategies modify incomplete objects by redundant computing. Hence, it is necessary to remove the extra auxiliary bands, optimize the volume during the shuffle stage, and reduce the redundant computing to improve the efficiency a lot further.

In this study, to further improve the efficiency of previous strategies, a new distributed strategy based on Apache Spark is proposed to solve the incomplete object issues of integrating sequential SLIC algorithms into a distributed cluster. The massive image tiles, generated by loading, decomposing, and distributing operations on a large-scale image, were divided into two categories—the even tiles and the odd tiles—according to their index. First, the even tiles were segmented independently and simultaneously by the SLIC algorithm. Then, the cluster centers and buffer sizes of four aspects (the left, the right, the bottom, and the top aspects of image tiles) of even tiles were acquired and encapsulated into the accumulator structure without employing extra auxiliary bands to record immediate results. The buffer sizes and cluster centers in the accumulator were switched from even tiles to odd tiles, and broadcasted around the distributed cluster. During the shuffle stage, the odd tiles acquired pixels from surrounding even tiles according to the buffer size for each aspect. The SLIC algorithm was used to segment the enlarged odd tiles again with the support of buffered pixels and cluster centers. Specifically, all pixels of the large-scale image were segmented only once without redundant computing to modify the incomplete objects. After the segmentation, the final segmentation results were generated by the ingestion operation. The proposed strategy was then evaluated in terms of the accuracy and execution efficiency to reveal its performance.

The main objectives of the research were to:

Modify incomplete objects by remaining to enlarge the superpixels, generated in even tiles, with shared cluster centers in odd tiles;
Employ the shared variables of Apache Spark to record intermediate results rather than introducing extra auxiliary bands;
Improve the efficiency by implementing shuffle operations among approximately half of the entire image tiles.

The organization of this paper is as follows. Section 2 describes the methods. Section 3 outlines the experimental design. Section 4 presents the results, and this is followed by the discussion in Section 5. Concluding remarks are given in the last section.

2. Methods

In this study, a newly designed distributed image segmentation strategy was proposed to transplant efficiently the sequential SLIC algorithms to the distributed environment based on the characteristics of Apache Spark. The strategy focused on the modification of incomplete objects generated in the border areas, and on a reduction of unnecessary data transmission during the shuffle stage. The following sections describe in detail the steps of the strategy, and the evaluation of the execution efficiency and accuracy.

2.1. Overview

As stated in several studies [38,39], the general data streams of the Apache Spark based distributed image segmentation applications are similar to each other, as shown in Figure 1a. First, a large-scale image is loaded and encapsulated into the Spark RDD (the Resilient Distributed Dataset) structure [40]. Second, the large-scale image is decomposed into multiple image tiles, which are going to be distributed to multiple designated calculation nodes over the distributed cluster, based on a predefined tile size. Afterwards, the distributed strategy would be carried out with the designated processing operations. Finally, the image tiles distributed in multiple calculation nodes over the distributed cluster are ingested to the master node to yield the final resultant image. Although the data streams are similar, the design and building of the distributed strategy is specialized. As illustrated in Figure 1b, the constructed distributed strategy included mainly three parts, namely, even tile segmentation, border object sharing, and odd tile segmentation, as illustrated in the following sections.

2.2. Even Tile Segmentation

As can be seen in Figure 2d, the layout of the decomposed image tiles may be regarded as a chessboard in which each element owns its unique index consisting of a column and row. According to the experimental design, the entire image tiles were divided into two categories according to their corresponding index. The criterion is whether the sum of the column and the row of current image tile is an integer multiple of two. If the sum is divisible by two, the related image tiles are referred to as the even tiles, and the others are referred to as the odd tiles. In Figure 2d, the index of even tiles is highlighted in purple while that of the odd tiles is highlighted in orange. Hence, the layout of image tiles is arranged with alternating even and odd squares in the chessboard.

According to Section 2.1, the decomposed image tiles have been transmitted to multiple calculation nodes over the distributed cluster to carry out the corresponding implementations. In even tile segmentation, each even tile is firstly segmented by the SLIC algorithm with the same segmentation theory as that implemented in a sequential way. The segmentation algorithm is independently and simultaneously applied to each calculation node with identical segmentation parameters and operating environment. To reduce unnecessary communications efficiently during the shuffle stage, the shared variables (the accumulator and the broadcast variables) of Apache Spark were employed to manage the intermediate processing results efficiently rather than using extra auxiliary bands. Hence, the band type and band numbers of both the input and output image tiles of even tile segmentation were the same as each other (three true color bands for this research).

All the pixels of the even tiles are updated at the end of the even tile segmentation. The adopted SLIC algorithm stores the attribute table which records the relationship between the pixel label and the superpixel spectrum which is the average spectrum of the included pixels. Then, the pixel spectrum is replaced by the spectrum of the superpixel it belongs to. In addition, the segmentation results of the even tiles are the final results which would not be modified but still provide border superpixel information to support subsequent processes.

2.3. Border Object Sharing

The border object sharing is primarily used to acquire, analyze, and transmit the border superpixels of even tiles, as illustrated in Figure 3. Furthermore, the superpixels of even tiles were divided into two categories, that is, the inner superpixels and the border superpixels, depending on whether they included the border pixels. As can be seen in Figure 3a, the inner superpixels were masked in gray, while the border superpixels were colorized with various colors. The red dots in Figure 3a indicate the cluster center of each border object, and the double headed arrows present the height or width of each border object. In general, the elements of the inner superpixels are immutable, while the border superpixels would be enlarged in the odd tile segmentation. Hence, the border superpixels were further separated into four aspects according to its locations (the top, right, bottom, and left of the even tile), when the entire pixels began to update their spectra during the even tile segmentation.

For each even tile, the location and the size of each border superpixel were recorded in the cluster center list. Then, the border superpixels located in the left and right aspects of the even tiles should separately collect the width, while the border superpixels located in the bottom and top aspects of the even tiles should separately collect the height. The collected values of each aspect of even tiles contain multiple width and height values because each aspect consists of many superpixels. Hence, the maximum width or height of each aspect in each even tile are filtered from the collected values and regarded as the buffer size to indicate the width or height of the buffering areas, as listed in Figure 3b. At last, the buffer size and corresponding cluster center locations of each aspect were organized together and encapsulated into the accumulator structure.

As can be seen in Figure 3c, the overall data flow of extracted information was accumulated from even tiles and broadcasted to odd tiles, and its detailed procedures are listed in the following. The buffer sizes and cluster centers of each even tile were all transformed to the master node from the calculation nodes over the distributed cluster. As the characteristic of the operating mechanism in Apache Spark by default is a lazy calculation [41,42], the shared variables do not participate in the calculations immediately until the action operation is implemented. Hence, the count operation of Apache Spark was implemented to acquire the shared data in the accumulator [43] and the persist operation [44] was also implemented to store the immediate values in computer memory to avoid unnecessary repeated calculations. As can be seen in Figure 3, the data acquired from the accumulator provide the buffer sizes and cluster centers extracted from the even tiles. In addition, the even tiles were approximately half of the entire image tiles, and the buffer sizes and cluster centers were mainly supported for the odd tiles. Hence, a switch function was required to transfer the buffer sizes and cluster centers to the corresponding adjacent odd tiles (as shown in Figure 3d), which acquires pixels from even tiles according to the buffer size specifically. After the switch operation, all of the buffer sizes and cluster centers of the odd tiles were encapsulated into the broadcast variable to transmit to multiple calculation nodes over the distributed cluster. In addition, the switched buffer sizes and the cluster centers of image tiles were regarded as the prior information to support the following odd tile segmentation procedure.

2.4. Odd Tile Segmentation

In odd tile segmentation, the image tiles used are buffered odd tiles, which are larger than original decomposed image tiles, as shown in Figure 2b,c. The buffered odd tiles are generated during the shuffle stage according to the switched buffer sizes. Furthermore, the pixels buffered from adjacent even tiles of each aspect vary rather than being the fixed maximum width of the total image tiles, which could reduce the unnecessary communications greatly.

As can be seen in Figure 2b, the buffered odd tiles consisted of the margin area and the target area, which are separately rendered with pink and white. The target area presents the original scope of the odd tiles, while the margin area consists of pixels acquired from the surrounding even tiles. In terms of the margin area, the cluster center locations are recorded in the broadcast variable which may be used directly in each calculation node. The position recorded in the broadcast variable and the extracted spectrum of these cluster centers would participate in the new calculations. However, the position and spectrum of the cluster centers buffered from even tiles should remain stable along with the iteration going on in this strategy.

In terms of the target area, the primary segmentation procedures are the same as those in the even tile segmentation. However, the border superpixels of the target area are not the boundary of the buffered odd tiles, and they are influenced greatly by the pixels from margin area. After the iteration stage, the buffered odd tiles just updated the pixels located in the target area and were also cropped according to the target grid boundaries. Finally, the image tiles over the distributed cluster were cached in the master node and stitched to one large-scale image, whose size is the same as the input image, and each superpixel had a regular shape and uniform spectrum.

2.5. Performance Evaluation

The performance of the proposed strategy was evaluated by two aspects, that is, the execution efficiency and accuracy. In terms of the execution efficiency, the entire execution time was collected as the speedup indicator to build a contrast experiment among the comparison strategies. The faster strategy can implement applications in less time under the same implementation environment. In addition, the difference between the proposed strategy and the comparison strategy was also generated to show the performance improvements for the proposed strategy.

In terms of accuracy assessment, the segmentation results generated in sequential way are regarded as the reference image. In this experiment, all segmentation results generated by the comparison strategies in a distributed way are compared with the reference image. The F-measure metric [45,46] was adopted to quantify the accuracy of the segmentation results. This metric is a composite of the precision and the recall metric which are given by Equations (1)–(3):

P r e c i s i o n_{i j} = \frac{a r e a (S_{i}^{r} \cap S_{j}^{s})}{a r e a (S_{j}^{s})}

(1)

R e c a l l_{i j} = \frac{a r e a (S_{i}^{r} \cap S_{j}^{s})}{a r e a (S_{i}^{r})}

(2)

R e c a l l_{i j} = \frac{a r e a (S_{i}^{r} \cap S_{j}^{s})}{a r e a (S_{i}^{r})}

(3)

where

S_{i}^{r}

and

S_{j}^{s}

are the set of pixels of the

i

-th reference superpixel and

j

-th segment superpixel, respectively.

P r e c i s i o n

and

R e c a l l

are separately the sum of

P r e c i s i o n_{i j}

and

R e c a l l_{i j}

. The F-measure ranges from 0 to 1 and

α

is a constant with the value of 0.5 in this research.

3. Experimental Design

3.1. Image Data and Preparation

As can be seen in Figure 4, there are a total of six scenes in the composite large-scale images, which were captured from the Landsat satellite [47] and were used to evaluate the proposed strategy. The images were acquired in 2018 and the spatial resolution is 0.00025° × 0.00025° (~30 m × 30 m). All of the images were organized with 40,000 pixels × 40,000 pixels (approximately 4.5 GB) and simply named as T1–T6. The approximate locations of the six large-scale images are illustrated in Figure 4g and the metadata (the simple name, the locations, the latitude and longitude ranges, and the main landscapes) are also listed in Table 1. Three true color bands (the red, green, and blue bands) were adopted as the input bands for all research. The landscapes of the six images are abundant and diverse in nature rather than being a single set of surface features. They are distributed around the world and include the tropical savanna in Africa, the tropical rainforest in South America, the tropical desert in Oceania, the highest plateau in Asia, and large-scale croplands in North America and in Eastern Europe.

3.2. SLIC Segmentation Algorithm

The SLIC image segmentation algorithm [48,49] has proven its advantages in terms of boundary adherence, computational complexity, memory efficiency, and improvement in the performance of the segmentation algorithm [50]; thus, it was adopted to evaluate the proposed distributed strategy over Apache Spark. However, the adopted SLIC algorithm in this study is not identical to that described in the original paper [48]. The basic procedures are outlined in the following section.

In SLIC, the image tiles are split into grids according to the predefined steps (simply as S). The SLIC is first initialized through sampling the spectrum (the red, green, and blue bands) and the position (the column and row numbers) information is in the center of the regular grid, which is generally regarded as the initial cluster center vector. To avoid selecting a noisy pixel as the cluster center, the average spectrum for eight pixels surrounding the cluster center is initialized as the spectrum of the cluster center. The spectrum and position information of the entire pixels in the image tiles are also extracted. In addition, the label and distance attributes are assigned with −1 and infinity for each pixel, respectively.

After initialization, the entire pixels should be allocated to the cluster centers one by one. For each cluster center, the distances between the cluster center and the pixels located in the 2S × 2S neighborhood are calculated; meanwhile, the label and distance attributes of each pixel are updated. After allocation, the cluster centers should be updated based on the mean vector of its included pixel attributes. The mean vector is regarded as the new cluster center vector, and this is used to calculate the Euclidean Norm with the last cluster center vector being used to discern the residual error. Then the program is stopped to acquire the final results if the residual error is less than the defined threshold, otherwise the program would continue iterating the steps after the initialization stage. In the end, the spectrum of the entire pixels is allocated with the latest spectrum of its corresponding superpixel.

The brief description of SLIC is based primarily on the works of Achanta, Shaji [48] with minimal modification. For example, the step parameter of the superpixel was used to control the superpixel size in this study, while that in the original algorithm had been determined approximately by the number of superpixels. There are several variables which are used to control the SLIC, such as the threshold of iterations, the residual errors, and the constant value. As the goal in this work was not to improve the performance of SLIC, the aforementioned parameters were fixed separately as 10, 100, and 20, respectively, for entire experiments. Further, the SLIC would be called through the program implementation stage in this experiment.

3.3. Comparison Strategies and Parameter Settings

To illustrate the performance of the proposed strategy, the comparison strategies were adopted to segment the large-scale images. For convenience, the strategies in the comparative experiments are named simply as S0, S1, and S2. As listed in Table 2, the S0 refers to segmenting images under the distributed environment without any border area processing operations and incomplete object modifications; the S1 refers to segmenting images with our proposed distributed strategy; and the S2 refers to segmenting images with the strategy proposed by Wang, Chen [38]. All strategies adopted the SLIC superpixel segmentation algorithm with identical parameters and an implementation environment. As described in the references, S2 processes the border incomplete objects by repeating calculations, while S1 segments the whole image only once.

The programs are organized into fat packages within the jar format and are implemented in the distributed cluster; the related basic service versions are listed in Table 3. The adopted SLIC algorithm is edited by the Scala language based on the Geo Trellis framework, then it is submitted to the distributed cluster terminal. Moreover, the Spark Submit parameters for each image listed in Table 4 are tested many times to acquire the best ones for rapid image processing. As there are eight physical calculation nodes which have remained active in the distributed cluster, the number of executors was separately set as four and eight in the experiment concerning efficiency.

In addition, the parallelism variable of the program was also tested many times to find the best one with respect to the reference value provided by the official guide document [27]. In the document, the best parallelism value for the distributed tasks is recommended as 2–3 times that of the total cores, while that with 1 time for our applications acquired the shortest execution time. Therefore, the total number of cores in the experiment was set as the parallelism value.

4. Results

The results of the accuracy assessment, the execution evaluation, and the visual comparison are listed in this section to present the performance from multiple viewpoints and to undertake essential analysis with respect to the comparison strategies.

4.1. Results of Accuracy Assessment

The accuracy assessment was based on the F-measure metrics of the resultant images. As mentioned in Section 3.1, the size of large-scale images is 40,000 × 40,000 pixels, which is difficult to be segmented in a sequential way. Therefore, the regions participating in the accuracy assessment should be a subset of the large-scale images. To assess the accuracy uniformly and reliably, the regions were selected randomly and three sub images for each large-scale image of T1–T6 were cropped to reduce the accidental result. The size of all sub images was set as 4000 pixels × 4000 pixels so that the volume may be readily segmented in a sequential environment.

The columns and rows of the large-scale images were separately divided by 1000 for convenience. First, the range value of the column and row numbers was set as 0 to 36 to confirm that the selected regions of interest could be fully located within the valid boundaries of the large-scale images. Then, the starting locations of the selected sub images were generated by the Random Function in Python [51] and are listed in Table 5. Each value in the location tuple multiplied by 1000 represents the starting column and row numbers in the large-scale images, respectively. The selected sub images are called the random regions in the following experiments, which are also assigned in related figures. As mentioned in Section 3.3, the accuracy comparison for each random region includes the strategy of S0, S1, and S2.

The accuracy assessment results for the random regions are acquired and illustrated in Figure 5 and Figure 6, in which the program is implemented based on the two groups of parameters in Table 4, respectively. In general, the F-measure difference between S0 and S1 was very similar with each other in all random regions of Figure 5 and Figure 6. The maximum F-measure difference between S0 and S1 was approximately 0.4% and 0.5% of Figure 5 and Figure 6, respectively. For S1, the F-measure value of S1 was always bigger than that of S0 in the same random region except one outlier case in R3 of Figure 5b, where the F-measure value of S1 was smaller than S0 by approximately 0.04%. Meanwhile, a similar situation occurred in Figure 6 with the outlier case in R1 of Figure 6f. For S2, the F-measure values of S2 in all random regions are significantly lower than those of S0 and S1 in both figures. The maximum and minimum F-measure differences between S1 and S2 are separately 1.66% and 3.89% in Figure 5, while those are separately 1.96% and 4.38% in Figure 6. The minimum F-measure values of S0, S1, and S2 are separately 90.0%, 90.0%, and 88.3% in Figure 5, while the values are separately 94.0%, 94.3%, and 90.2% in Figure 6. In addition, the average F-measure values of S0, S1, and S2 are separately 94.3%, 94.5%, and 91.9% in Figure 5, while they are separately 95.1%, 95.3%, and 92.4% in Figure 6. From the perspective of the F-measure metrics, the proposed strategy (S1) may generate better results than S0, which means S1 could modify the border incomplete objects of the image tiles effectively. In addition, S1 could also implement better performance than S2 in terms of the border incomplete object modification using SLIC according to the F-measure metric.

4.2. Results of Efficiency Evaluation

In this part, the execution efficiency of the various strategies was acquired and compared, the results being listed in Figure 7, Figure 8 and Figure 9. The experiment was implemented according to the basic services listed in Table 3 and the execution parameters in Group 1 of Table 4, which are all immutable parameters. Furthermore, there are mutable parameters indicating the variation in accord with the executor and the parallelism numbers as listed in Table 6. In this experiment, the parallelism number was acquired by the number of executors multiplied by the number of executor cores, which is fixed as four. The number of executors was set as four and eight separately; hence, the parallelism number varies between 16 and 32. The parameters mentioned were used to implement the comparison strategies on T1 to T6 over the distributed cluster to get the execution time and estimate the difference among the different strategies.

As can be seen from the figures, each sub figure includes the execution time and mean execution time. The execution time was acquired directly from the yarn manager. To overcome the unstable conditions which occurred during the implementation period, 10 execution times were acquired for each comparison strategy to maintain the data quality. Hence, the mean execution time is the average of 10 repeat execution times for each case. As illustrated in Figure 7 and Figure 8, the trends of all figures are similar, the execution times of each case being scattered within a limited range rather than a specific point. The mean execution time is increased from S0 to S1 to S2 in all sub figures, and S2 uses the longest execution time while S0 uses the shortest time. In general, the difference of the mean execution time between S2 and S0 is larger than that between S1 and S0, while that between the S2 and S1 are far less than the previous two differences in all cases. The actual variation ratio for each image is varied around the image patterns. On the one hand, the execution time of each image with four executors takes a longer time compared with eight executors for S0, S1, and S2, which confirms the efficiency advantage for a higher executor number. On the other hand, the execution time difference with four executors between S1 and S2 was significantly larger compared with that with eight executors in almost all large-scale images, as illustrated in Figure 9. Moreover, it was verified experimentally that the execution time difference of S1 and S2 using various executor numbers is significant at the 95% confidence level.

In addition to the above experimental results, we also split the total procedure into five steps (the decomposition, even tiles segmentation, shuffling, odd tiles segmentation, and integration stage) and acquired the execution time of them separately, which were illustrated in the Figure 10 and Figure 11. Furthermore, the data volume during the shuffle stage was also recorded and listed in Table 7. As can be seen in Figure 10, the total execution time of S1 was less than that of S2 in each sub figure. The execution time of two rounds of implementation was relatively similar for S1, while the first round of implementation was significantly longer than that of the second round for S2. Meanwhile, the shuffle time and integration time are listed in Figure 11. From the view of Figure 11a, S1 takes only half time of S2 in shuffle stage. From the view of Figure 11b, S1 also takes less time than that of S2 in integration stage. In Table 7, the shuffle, odd tiles segmentation, and integration stages of S1 generate significantly smaller data volume than that of S2. As can be seen in Figure 12, compared with the selected comparison strategy, the elapsed time of the Segmentation, the Shuffle, and the Integration stage was improved approximately 6.4%, 51.8%, and 27.5% separately, and the data volume transmission of these three stages was reduced approximately 29.8%, 49.8%, and 26.3% respectively.

4.3. Results for Visual Comparison

The SLIC segmentation results of the subset of T3 are illustrated in Figure 13. In this research, the generated superpixels are not outlined with specific color lines; rather, they are presented with the corresponding mean spectrum value. To show the image patterns clearly, the figures are all stretched by 2% with the ENVI 5.3 software. The red dashed lines in the middle of Figure 13j–l are the decomposition boundary of the image tiles, and the upper and lower regions belong to different image tiles. Besides, the yellow and black circles in the subfigures represent the addressed regions of the segmentation results. The inner superpixels of each tile are very similar to that in the same location as the reference image.

In terms of the first column of Figure 13, the strategy of S1 could modify the incomplete superpixels of border areas well as shown in the yellow and black circles, and the superpixels were similar to the reference ones. The strategy of S2 could modify the incomplete superpixels well as shown in the black circle, while the superpixels in the yellow circle were still incomplete. In terms of the second column of Figure 13, both S1 and S2 could modify the incomplete superpixels of border areas of adjacent tile, but they were not so similar with the reference ones. In terms of the last column of Figure 13, the incomplete superpixels in both circles could be modified by S1, but S2 split the cropland into two parts in the yellow circle and generated the incomplete superpixels in the black circle. Hence, S1 could modify the incomplete superpixels well in border areas and acquire more similar superpixels than S2 in the illustrated figures.

5. Discussion

From the perspective of Figure 5 and Figure 6, the F-measure values of S0 and S1 were very similar with each other in all random regions. Furthermore, the F-measure values of S2 were not only less than S0 but also less than S1 in all cases. As known to us, the strategy of S2 has been proposed and its efficiency proved with F-measure metrics in several experiments, where it also achieves better performance than corresponding comparison strategies in these cases [38,39]. This situation may be caused by the characteristics of the testing segmentation algorithm. The SLIC algorithm is a classical superpixel generation method which could produce the square-like superpixels with defined size and regular shape. In general, the actual landscape object is usually split by the superpixel boundary into two or more parts if it is far bigger than the superpixel size. Therefore, the edge of image tiles is often the boundary for many superpixels. The inner superpixels of S0, S1, and S2 in identical locations were approximately the same as each other, so the minimum F-measure of comparison strategies in random regions was relatively high. Hence, the border area processing of our strategy is more suitable for the characteristics of the SLIC algorithm compared with S2, which follows the repeating calculation rules.

In efficiency evaluation experiments, the acquired 10 repeated execution time is usually different although implemented with identical parameters, and this situation always appeared in our experiment. Although this is mainly caused by the hardware conditions of distributed cluster and the task scheduling circumstances of Apache Spark, the general trend for each case could be discerned by recording redundancy data. Moreover, Figure 7 and Figure 8 with scatter points could also reflect the actual trend of the experiment. In terms of the execution time, both S1 and S2 took a far longer time than S0. The strategy of S2 needs the repeat calculations to modify the incomplete superpixels, which consumes some time. The strategy of S1 removes the repetition calculation characteristics of S2 and modifies incomplete superpixels of image tiles by enlarging the superpixels with shared cluster centers. However, the even and odd tiles of S1 must be processed in two different times, and the odd tiles could not be processed until the even tiles finished the processing task completely. In contrast, the strategy of S0 just segments image tiles without any border area operations, so it is significantly faster than the other two strategies.

We can see that the execution time of experiment with eight executors is faster than that with four executors, and the difference between S1 and S2 is larger in the latter case under identical parameters. On the one hand, the number of executors could weaken the advances of S1 with respect to S2, as the map operation is seeming like an average of the total execution time in a sequential way. Hence, the higher the number of executors, the less the execution time difference between the two strategies. On the other hand, the adopted large-scale images are not large enough, and the larger the image size, the more advances the strategy would present. In addition, the partition method of Apache Spark could also influence the execution time, as the default partition method was used in all experiments. It may cause the waiting problem among calculation nodes, which leads to S1 just bringing less advances in most cases.

As can be seen the results visually in Figure 13, the superpixels generated by S1 in the black circle of Figure 13n were different from that of Figure 13h. This situation is mainly caused by the characteristics of our strategy, as the even tiles were segmented firstly, whose superpixels remained fixed till the end. In addition, our research did not evaluate the segmentation results with unsupervised models, because all the segmentation operations are strictly based on the theory of SLIC algorithm in entire calculations. Hence, the superpixels generated by each strategy are credible in terms of the unsupervised metrics.

6. Conclusions

In the satellite earth observation field, there is an ongoing exponential increase of the data volume, which brings new challenges for conventional image processing. To this end, employment of sequential algorithms to process large-scale images over the distributed cluster is becoming a new research field. However, most sequential algorithms directly transplanted to the distributed environment would cause incomplete object issues in border areas of the image tiles. To implement sequential algorithms over the distributed cluster conveniently, the SLIC algorithm was selected as the study example to address the incomplete object issue. To this end, a new distributed image segmentation strategy based on the big data platform—Apache Spark—was proposed. A large-scale image was loaded and decomposed into multiple image tiles. All the image tiles were divided into even and odd tiles, which participated in two independent calculation processes, according to the indexes. First, the even tiles were split into superpixels by the SLIC algorithm. Second, the locations of the cluster centers and the buffer sizes of each aspect of even tiles were recorded, extracted, and managed by the shared variables, which were analyzed and organized as prior information. Then, the odd tiles were shuffled to acquire the pixels of the even tiles based on the buffer sizes. Meanwhile, the buffered odd tiles were segmented by the SLIC algorithm using the same parameters as the even tile segmentation with the help of the prior information. Finally, the even and odd tiles were ingested from the calculation nodes to the master node to acquire final result. According to the performance evaluation results, the proposed strategy resulted in improved F-measure values and faster run times than the comparison strategies. In addition, the increases in the execution efficiency trends along with the decreasing executor numbers mean the proposed strategy could run faster with limited calculation resources. The new strategy modifies the incomplete superpixels well by enlarging the superpixels with shared cluster centers and optimizes the communications by carrying out the shuffle operations among only approximately half of entire image tiles. These changes not only improved the accuracy but also increased the execution efficiency and reduced the data volume transmission over Apache Spark. Hence, the proposed strategy is more suitable for the superpixel algorithms as its superpixels are more similar to those in the reference images.

In future work, the new distributed strategy should be tested in larger-scale images. Some optimized operations should also give focus to the partition method of Apache Spark to remove cases of data skew and improve execution efficiencies. In addition, further applications should address practical demands rather than just stand on image segmentation.

Author Contributions

Conceptualization, L.W.; methodology, N.W.; software, N.W.; validation, B.Y. and L.W.; formal analysis, F.C.; investigation, N.W.; resources, F.C.; data curation, N.W.; writing—original draft preparation, N.W.; writing—review and editing, B.Y.; visualization, L.W.; supervision, F.C.; project administration, L.W.; funding acquisition, F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the National Natural Science Foundation of China (41871345) and China-ASEAN Big Earth Data Platform and Applications (CADA, guikeAA20302022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yao, X.; Li, G.; Xia, J.; Ben, J.; Cao, Q.; Zhao, L.; Ma, Y.; Zhang, L.; Zhu, D. Enabling the Big Earth Observation Data via Cloud Computing and DGGS: Opportunities and Challenges. Remote Sens. 2020, 12, 62. [Google Scholar] [CrossRef] [Green Version]
Toth, C.; Jóźków, G. Remote sensing platforms and sensors: A survey. ISPRS J. Photogramm. Remote Sens. 2016, 115, 22–36. [Google Scholar] [CrossRef]
Cheng, Y.; Zhou, K.; Wang, J.; Yan, J. Big Earth Observation Data Integration in Remote Sensing Based on a Distributed Spatial Framework. Remote Sens. 2020, 12, 972. [Google Scholar] [CrossRef] [Green Version]
Zhou, X.; Wang, X.; Zhou, Y.; Lin, Q.; Zhao, J.; Meng, X. RSIMS: Large-Scale Heterogeneous Remote Sensing Images Management System. Remote Sens. 2021, 13, 1815. [Google Scholar] [CrossRef]
Guo, H. Big data drives the development of Earth science. Big Earth Data 2017, 1, 1–3. [Google Scholar] [CrossRef] [Green Version]
Ma, Y.; Wu, H.; Wang, L.; Huang, B.; Ranjan, R.; Zomaya, A.; Jie, W. Remote sensing big data computing: Challenges and opportunities. Futur. Gener. Comput. Syst. 2015, 51, 47–60. [Google Scholar] [CrossRef] [Green Version]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.M.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.M.; Anders, K.; Gloaguen, R.; et al. Multisource and multitemporal data fusion in remote sensing a comprehensive review of the state of the art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef] [Green Version]
Guo, H. Big Earth data: A new frontier in Earth and information sciences. Big Earth Data 2017, 1, 4–20. [Google Scholar] [CrossRef] [Green Version]
Yang, J.; Gong, P.; Fu, R.; Zhang, M.; Chen, J.; Liang, J.; Xu, B.; Shi, J.; Dickinson, R. The role of satellite remote sensing in climate change studies. Nat. Clim. Change 2013, 3, 875–883. [Google Scholar] [CrossRef]
Pekel, J.-F.; Cottam, A.; Gorelick, N.; Belward, A.S. High-resolution mapping of global surface water and its long-term changes. Nature 2016, 540, 418–422. [Google Scholar] [CrossRef] [PubMed]
Chen, F.; Zhang, M.; Guo, H.; Allen, S.; Kargel, J.S.; Haritashya, U.K.; Watson, C.S. Annual 30 m dataset for glacial lakes in High Mountain Asia from 2008 to 2017. Earth Syst. Sci. Data 2021, 13, 741–766. [Google Scholar] [CrossRef]
Lee, J.; Wang, J.; Crandall, D.; Sabanovic, S.; Fox, G. Real-Time, Cloud-Based Object Detection for Unmanned Aerial Vehicles. In Proceedings of the 2017 First IEEE International Conference on Robotic Computing (IRC), Taichung, Taiwan, China, 10-12 April 2017; pp. 36–43. [Google Scholar]
Burrows, K.; Walters, R.J.; Milledge, D.; Spaans, K.; Densmore, A.L. A New Method for Large-Scale Landslide Classification from Satellite Radar. Remote Sens. 2019, 11, 237. [Google Scholar] [CrossRef] [Green Version]
Chen, F.; Yu, B.; Li, B. A practical trial of landslide detection from single-temporal Landsat8 images using contour-based proposals and random forest: A case study of national Nepal. Landslides 2018, 15, 453–464. [Google Scholar] [CrossRef]
Yu, B.; Chen, F.; Xu, C. Landslide detection based on contour-based deep learning framework in case of national scale of Nepal in 2015. Comput. Geosci. 2020, 135, 104388. [Google Scholar] [CrossRef]
Guo, H.; Chen, F.; Sun, Z.; Liu, J.; Liang, D. Big Earth Data: A practice of sustainability science to achieve the Sustainable Development Goals. Sci. Bull. 2021, 66, 1050–1053. [Google Scholar] [CrossRef]
Giachetta, R. A framework for processing large scale geospatial and remote sensing data in MapReduce environment. Comput. Graph. 2015, 49, 37–46. [Google Scholar] [CrossRef]
Damji, J.S.; Wenig, B.; Das, T.; Lee, D. Learning Spark: Lightning-Fast Data Analytics; O’Reilly Media: Sebastopol, CA, USA, 2020. [Google Scholar]
Google Earth Engine, A Planetary-Scale Platform for Earth Science & Data Analysis. Available online: https://earthengine.google.com/ (accessed on 12 July 2021).
Kumar, L.; Mutanga, O. Google Earth Engine Applications Since Inception: Usage, Trends, and Potential. Remote Sens. 2018, 10, 1509. [Google Scholar] [CrossRef] [Green Version]
Shelestov, A.; Lavreniuk, M.; Kussul, N.; Novikov, A.; Skakun, S. Exploring Google Earth Engine Platform for Big Data Processing: Classification of Multi-Temporal Satellite Imagery for Crop Mapping. Front. Earth Sci. 2017, 5, 1–10. [Google Scholar] [CrossRef] [Green Version]
Ou, C.; Yang, J.; Du, Z.; Liu, Y.; Feng, Q.; Zhu, D. Long-Term Mapping of a Greenhouse in a Typical Protected Agricultural Region Using Landsat Imagery and the Google Earth Engine. Remote Sens. 2020, 12, 55. [Google Scholar] [CrossRef] [Green Version]
Sun, Z.; Xu, R.; Du, W.; Wang, L.; Lu, D. High-Resolution Urban Land Mapping in China from Sentinel 1A/2 Imagery Based on Google Earth Engine. Remote Sens. 2019, 11, 752. [Google Scholar] [CrossRef] [Green Version]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef] [Green Version]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Apache! ApacheTM Hadoop®! Available online: http://hadoop.apache.org/ (accessed on 12 July 2021).
Apache Spark-Unified engine for large-scale data analytics. Available online: http://spark.apache.org/ (accessed on 12 July 2021).
Apache Flink: Stateful Computations over Data Streams. Available online: https://flink.apache.org/ (accessed on 12 July 2021).
Zhang, J.; Wu, G.; Hu, X.; Li, S.; Hao, S. A Parallel Clustering Algorithm with MPI–MKmeans. J. Comput. 2013, 8, 10–17. [Google Scholar] [CrossRef]
Ramírez-Gallego, S.; Fernández, A.; García, S.; Chen, M.; Herrera, F. Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Inf. Fusion 2018, 42, 51–61. [Google Scholar] [CrossRef]
Huang, B.; Zhao, B.; Song, Y. Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery. Remote Sens. Environ. 2018, 214, 73–86. [Google Scholar] [CrossRef]
Michel, J.; Youssefi, D.; Grizonnet, M. Stable Mean-Shift Algorithm and Its Application to the Segmentation of Arbitrarily Large Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 952–964. [Google Scholar] [CrossRef]
Lassalle, P.; Inglada, J.; Michel, J.; Grizonnet, M.; Malik, J. A Scalable Tile-Based Framework for Region-Merging Segmentation. IEEE Trans. Geosci. Remote Sens. 2015, 53, 5473–5485. [Google Scholar] [CrossRef]
Derksen, D.; Inglada, J.; Michel, J. Scaling Up SLIC Superpixels Using a Tile-Based Approach. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3073–3085. [Google Scholar] [CrossRef]
Lin, W.; Li, Y. Parallel Regional Segmentation Method of High-Resolution Remote Sensing Image Based on Minimum Spanning Tree. Remote Sens. 2020, 12, 783. [Google Scholar] [CrossRef] [Green Version]
Gu, H.; Han, Y.; Yang, Y.; Li, H.; Liu, Z.; Soergel, U.; Blaschke, T.; Cui, S. An Efficient Parallel Multi-Scale Segmentation Method for Remote Sensing Imagery. Remote Sens. 2018, 10, 590. [Google Scholar] [CrossRef] [Green Version]
Happ, P.N.; Da Costa, G.A.O.P.; Bentes, C.; Feitosa, R.Q.; Ferreira, R.D.S.; Farias, R. A Cloud Computing Strategy for Region-Growing Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 5294–5303. [Google Scholar] [CrossRef]
Wang, N.; Chen, F.; Yu, B.; Qin, Y. Segmentation of large-scale remotely sensed images on a Spark platform: A strategy for handling massive image tiles with the MapReduce model. ISPRS J. Photogramm. Remote Sens. 2020, 162, 137–147. [Google Scholar] [CrossRef]
Chen, F.; Wang, N.; Yu, B.; Qin, Y.; Wang, L. A Strategy of Parallel Seed-Based Image Segmentation Algorithms for Handling Massive Image Tiles over the Spark Platform. Remote Sens. 2021, 13, 1969. [Google Scholar] [CrossRef]
Gounaris, A.; Torres, J. A Methodology for Spark Parameter Tuning. Big Data Res. 2018, 11, 22–32. [Google Scholar] [CrossRef] [Green Version]
Sehrish, S.; Kowalkowski, J.; Paterno, M. Spark and HPC for High Energy Physics Data Analyses. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lake Buena Vista, FL, USA, 29 May–2 June 2017; pp. 1048–1057. [Google Scholar]
Karim, M.R.; Cochez, M.; Beyan, O.D.; Farhan Ahmed, C.; Decker, S. Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach. Inf. Sci. 2018, 432, 278–300. [Google Scholar] [CrossRef] [Green Version]
Yu, J.; Zhang, Z.; Sarwat, M. Spatial data management in apache spark: The GeoSpark perspective and beyond. GeoInformatica 2019, 23, 37–78. [Google Scholar] [CrossRef]
Mezzoudj, S.; Behloul, A.; Seghir, R.; Saadna, Y. A parallel content-based image retrieval system using spark and tachyon frameworks. J. King Saud Univ.-Comput. Inf. Sci. 2021, 33, 141–149. [Google Scholar] [CrossRef]
Zhang, X.; Feng, X.; Xiao, P.; He, G.; Zhu, L. Segmentation quality evaluation using region-based precision and recall measures for remote sensing images. ISPRS J. Photogramm. Remote Sens. 2015, 102, 73–84. [Google Scholar] [CrossRef]
Cai, L.; Shi, W.; Miao, Z.; Hao, M. Accuracy Assessment Measures for Object Extraction from Remote Sensing Images. Remote Sens. 2018, 10, 303. [Google Scholar] [CrossRef] [Green Version]
Yin, R.; He, G.; Wang, G.; Long, T. 30-Meter Global Mosaic Map of 2018. Sci. Data Bank 2019, 4. Available online: https://www.scidb.cn/en/detail?dataSetId=633694461368467459&dataSetType=journal (accessed on 1 March 2022).
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Boemer, F.; Ratner, E.; Lendasse, A. Parameter-free image segmentation with SLIC. Neurocomputing 2018, 277, 228–236. [Google Scholar] [CrossRef]
Stutz, D.; Hermans, A.; Leibe, B. Superpixels: An evaluation of the state-of-the-art. Comput. Vis. Image Underst. 2018, 166, 1–27. [Google Scholar] [CrossRef] [Green Version]
Random Function. Available online: https://docs.python.org/zh-cn/3.7/library/random.html (accessed on 12 July 2021).

Figure 1. The data streams of the distributed image segmentation strategy based on Apache Spark. (a) The overall workflow; (b) The main procedures of the distributed strategy.

Figure 2. The detailed workflow of the distributed image segmentation strategy. (a) Filtered Odd tiles; (b) Buffered Odd tiles; (c) Segmented Odd tiles; (d) Decomposed image; (e) Distributed image tiles; (f) Segmented Even tiles.

Figure 3. The extraction and switching of buffer sizes and cluster centers. (a) Segmented Even tile; (b) Extracted information; (c) Broadcasted information; (d) Switched information.

Figure 4. The six true color large-scale images for the distributed image segmentation experiments.

Figure 5. Accuracy assessment of three comparison strategies for random regions with the tile size 500 and step 25.

Figure 6. Accuracy assessment of three comparison strategies for random regions with the tile size 1000 and step 50.

Figure 7. Execution time of three comparison strategies for T1 to T6 using the SLIC algorithm with four executors.

Figure 8. Execution time of three comparison strategies for T1 to T6 using the SLIC algorithm with eight executors.

Figure 9. The time difference between S1 and S2 under various numbers of executors.

Figure 10. Execution time of the first, the second, and the total segmentation stage (indicated as blue, green, and black) for T1 to T6 using the SLIC algorithm with eight executors. The red plus indicates the outliers.

Figure 11. Execution time of the shuffle and integration stage of S1 and S2 for T1 to T6 using the SLIC algorithm with eight executors, while the S1 and S2 are separately in blue and green. The red plus indicates the outliers.

Figure 12. The improvement ratio of the elapsed time and data volume transmission in our strategy.

Figure 13. The SLIC segmentation results of three comparison strategies for T3.

Table 1. The metadata of the six composite images.

No.	Locations	Latitude Range	Longitude Range	Main Landscapes
T1	Africa	0°S~10°S	20°E~30°E	Tropical savanna
T2	Asia	30°N~40°N	90°E~100°E	Plateau montane
T3	Europe	50°N~60°N	30°E~40°E	Cropland
T4	North America	30°N~40°N	90°W~100°W	Cropland
T5	Oceania	20°S~30°S	130°E~140°E	Tropical desert
T6	South America	0°S~10°S	60°W~70°W	Tropical rainforest

Table 2. Strategies used for comparison and the simple nomenclature.

Strategy	Description
S0	Segmenting the images parallelly without any border area processing
S1	The distributed strategy proposed in this work
S2	The strategy proposed in Wang, Chen [38]

Table 3. The version information of the basic services in the distributed cluster.

Service	Ambari	Hadoop	Spark	Geo Trellis	Scala
Version	2.6.2	2.7.3	2.2.0	2.1.0	2.11.8

Table 4. The immutable parameter settings of the Spark Submit System in the experiment.

No.	Driver Memory	Executor Memory	Steps	Tile Size
Group 1	20 GB	10 GB	25	500
Group 2	20 GB	10 GB	50	1000

Table 5. The shrunken column and row number locations of the selected random regions.

NO.	T1	T2	T3	T4	T5	T6
Region 1	(25, 14)	(04, 04)	(28, 16)	(5, 14)	(12, 21)	(13, 20)
Region 2	(24, 31)	(27, 28)	(21, 31)	(27, 02)	(16, 14)	(23, 32)
Region 3	(15, 01)	(31, 36)	(04, 15)	(08, 27)	(33, 21)	(04, 19)

Table 6. The mutable parameter settings of the Spark submit.

Element Settings	Group 1	Group 2
Number of Executors	4	8
Executor cores	4	4
Parallelism	16	32

Table 7. The data volume variations during in different processing stages.

No.	Stage	S1			S2
No.	Stage	Input	Shuffle Read	Shuffle Write	Input	Shuffle Read	Shuffle Write
T1	Decomposition			3.8 G			3.8 G
	Segmentation		3.8 G			3.8 G
	Shuffle	4.5 G		3 G	9.1 G		4.2 G
	Segmentation		3 G			4.2 G
	Integration	5.2 G			7 G
T2	Decomposition			4.1 G			4.1 G
	Segmentation		4.1 G			4.1 G
	Shuffle	4.5 G		3.2 G	8.9 G		4.7 G
	Segmentation		3.2 G			4.7 G
	Integration	5.2 G			7.2 G
T3	Decomposition			3.8 G			3.8 G
	Segmentation		3.8 G			3.8 G
	Shuffle	4.5 G		3 G	8.9 G		4.3 G
	Segmentation		3 G			4.3 G
	Integration	5.2 G			7.1 G
T4	Decomposition			4.1 G			4.1 G
	Segmentation		4.1 G			4.1 G
	Shuffle	4.5 G		3.3 G	8.9 G		4.7 G
	Segmentation		3.3 G			4.7 G
	Integration	5.2 G			7 G
T5	Decomposition			4.2 G			4.2 G
	Segmentation		4.2 G			4.2 G
	Shuffle	4.5 G		3.3 G	8.9 G		4.6 G
	Segmentation		3.3 G			4.6 G
	Integration	5.1 G			6.9 G
T6	Decomposition			3.7 G			3.7 G
	Segmentation		3.7 G			3.7 G
	Shuffle	4.5 G		2.8 G	9.1 G		4 G
	Segmentation		2.8 G			4 G
	Integration	5.2 G			7 G

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, N.; Chen, F.; Yu, B.; Wang, L. A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark. Remote Sens. 2022, 14, 1568. https://doi.org/10.3390/rs14071568

AMA Style

Wang N, Chen F, Yu B, Wang L. A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark. Remote Sensing. 2022; 14(7):1568. https://doi.org/10.3390/rs14071568

Chicago/Turabian Style

Wang, Ning, Fang Chen, Bo Yu, and Lei Wang. 2022. "A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark" Remote Sensing 14, no. 7: 1568. https://doi.org/10.3390/rs14071568

APA Style

Wang, N., Chen, F., Yu, B., & Wang, L. (2022). A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark. Remote Sensing, 14(7), 1568. https://doi.org/10.3390/rs14071568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark

Abstract

1. Introduction

2. Methods

2.1. Overview

2.2. Even Tile Segmentation

2.3. Border Object Sharing

2.4. Odd Tile Segmentation

2.5. Performance Evaluation

3. Experimental Design

3.1. Image Data and Preparation

3.2. SLIC Segmentation Algorithm

3.3. Comparison Strategies and Parameter Settings

4. Results

4.1. Results of Accuracy Assessment

4.2. Results of Efficiency Evaluation

4.3. Results for Visual Comparison

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI