2.2.1. Fisheries’ Data Processing
- (1)
Calculate CPUE
Catch Per Unit Effort (CPUE) is a measure commonly used to reflect the abundance of a fishery resource. In this experiment, CPUE is calculated using the number of hooks and the mass of the catch in kilograms within a 5° × 5° grid (in kg/thousand hooks). The formulae are as follows:
where
,
, and
denote the CPUE, monthly total catch quality, and monthly total number of hooks released for the range of longitude
i and latitude
j, respectively.
- (2)
Expansion of monthly data grids
The original fisheries data are organized on a 5° × 5° grid, which is too coarse to capture the distribution of habitats in detail for each month. At the same time, fisheries’ data for some months are partially missing, resulting in incomplete coverage of the whole study area and hindering the analysis of the distribution pattern of fish stocks in different months. The inverse-distance weighting (IDW) method was employed in this experiment to address the above issues and align the spatial resolution of the environmental data by interpolating and expanding the fisheries data on a monthly basis over the geographic extent of the studied habitats.
The inverse-distance weighting (IDW) method is one of the commonly used interpolation techniques in Geostatistics for regridding irregular points or gridded data over a spatial extent. The basic reasoning of IDW is that all known measurement points exert a local influence on the complementary points, with the influence inversely proportional to the distance between them—the closer the points, the greater the influence, and vice versa. If the number of known measurement points in the spatial extent that influence the complementary points is n, and the weights are assigned according to the distance between the measurement points and the complementary points, decreasing as the distance increases [
20]. The complementary points F are calculated as follows:
where F denotes the result obtained by interpolation of the inverse-distance weight method,
n denotes the number of measurement points that influence the interpolated point,
denotes the value of the i-th measurement point, and
denotes the weight of the i-th measurement point. The formula for
is as follows:
where
p is the weight coefficient, which can be any positive integer, and the default value of 2 is chosen for this experiment;
hi denotes the Euclidean distance between the i-th measurement point and the complementary point.
In this experiment, the fishery data were divided by month to study the changing pattern of the monthly habitat better, and the data of each month were regridded using the inverse-distance weighting method. The single-month habitat data for Bigeye Tuna were generated by dividing the area into 1° × 1° grid. The weights were assigned based on the distances from the grid points to the known points, and the full values of the filled grid points were calculated. For example, the grid expansion before and after regridding for January 2006 is shown in
Figure 1 and
Figure 2.
In this experiment, the gridded fisheries data were expanded to 329 entries per month by the inverse-distance weighting method, totaling 32,571 entries for January and May to December 2003–2013. The spatial resolution of the fisheries data was improved from 5° × 5° to 1° × 1°, providing a more detailed portrayal of the monthly habitat distribution of Bigeye Tuna. The increase in the number of samples reduced the likelihood of random errors in model training, while the gap in spatial resolution with product-level environmental data was narrowed, facilitating subsequent data-matching.
2.2.2. Remote-Sensing Data Processing
The two remote-sensing data used in this experiment need to be preprocessed separately due to the significant differences in data structure and spatiotemporal resolution.
- (1)
The processing of missing Env data
The environmental data, i.e., the product-level data of marine environmental factors (Env data), consist of five kinds: 0~300 m Sea Surface Temperature (SST), Chlorophyll Concentration (Chlα), Sea Level Anomaly (SLA), 0~400 m Sea Surface Salinity (SSS), and 200~300 m Dissolved Oxygen Concentration (O). Due to the different longitudinal depth ranges selected for the five marine Env factors, they were selected into 11 single Env factors with fixed depth points based on depth segmentation in this experiment [
21], i.e., sst0, sst100, sst200, sst300, Chlα, sla, sss0, sss200, sss400, o200, o300.
The initially acquired data files for the five marine environmental factors include four attribute dimensions, i.e., time, depth, latitude, and longitude. Eleven marine environmental factors at various depths are generated by reading Env data at specific fixed depth points. The Random Forest algorithm is used in this experiment to fill in a small number of consecutive missing data values in the selected geospatial area due to the presence of missing data in the original Env data. The partial rows and columns without missing values are selected as training data according to the correlation of continuous data. After generating the Random Forest model, the rows and columns with the smallest number of missing values are supplemented first. The dataset is updated after each filling, and this process is repeated until all missing values are filled.
- (2)
The preprocessing of L1B data
The original multispectral remote-sensing data used in this experiment are MODIS L1B level data containing 36 bands (L1B data). Bands 8~16 are related to the remote-sensing information of ocean water color and phytoplankton, which are often used to invert the ocean products of MODIS data, and bands 31~32 are related to the Earth’s surface, cloud-top temperatures, and the atmosphere, which can identify invalid values generated by the occlusion of cloudy features. Therefore, the above 11 bands are selected as experimental data to study the relationship between marine remote-sensing data and the distribution of habitats.
The L1B data need to be processed to identify the cloud feature occlusion regions before use to remove invalid factors of non-marine information, and the method selected for this experiment is the Normalized Detection Index method [
22]. There is a difference between the reflectance of the ocean surface and cloud features in the visible (0.66 μm) and thermal infrared (11 μm) bands. The probability that a point in the ocean is greater than the difference between the two is greater after normalization since the ocean surface has a lower reflectance at 0.66 μm and a higher reflectance at 11 μm. The probability of a point being an ocean is calculated using the Normalized Detection Index method:
where
P denotes the probability that a point is an ocean,
and
denote the reflectivity of the point in the 0.66 μm band and the 11 μm band, and
f(x) is a normalization processing function denoting the normalization of
x to the interval [–1, 1]. A point is considered to be an ocean when it has
P ≥ 80%.
Several data still exist at the same pixel point each month after identifying the filtered invalid values because the temporal resolution of the L1B data is measured in days. The remaining L1B data for each month were divided into groups of equal days in chronological order in order to harmonize with the temporal resolution of the fishery data, and a mean value was obtained for each group, ultimately retaining 5 data per month to match the fishery data The error introduced by the missing value treatment is greatly reduced by such a treatment, while the features of the L1B data in continuous time are preserved.
The temporal resolution of both Env and fishery data is in months without any further processing. The Env dataset has a spatial resolution of 0.25° × 0.25°, while the fisheries data have a spatial resolution of 1° × 1° after grid expansion. To make the two datasets compatible, the matrix matching method was used: the Env data were organized into 10×10 size matrices centered on the fisheries’ data points [
23,
24], encompassing a 2.5° × 2.5° range of marine environmental conditions around each fisheries’ data center point. The dataset generated after matching is referred to as Env_Fishery, with its data structure illustrated in
Figure 3.
The method preserves the original resolution of Env data, effectively reducing calculation errors relative to the traditional mean processing matching method. The spatial characteristics of various environmental factors can be accurately reflected by the Env_Fishery dataset, which is the smallest unit after matching the environment and fishery data, providing rich and comprehensive information on environment-fishery linkages.
The Env_Fishery dataset has been further organized to temporal patterns in the distribution of habitats and the correlation between temporal characteristics of marine environmental factors and the formation of fisheries.
Studies have shown that the effects of marine environmental factors on the distribution of habitats exhibit a lag, with the impacts of water temperature and phytoplankton content on CPUE values in the current month persisting for several months [
25]. However, the distribution of fishing habitats at the monthly granularity demonstrates temporal continuity, showing similarities between the same month in different years. Therefore, if the object of study is the distribution of fishing grounds in month m of year y:
First, a time series of multiple months within the same year is constructed with a step size of 3 months. The Env_Fishery data of m-January, m-February, and m-March of year y are aligned to correspond with the CPUE value of month m of year y, allowing for the exploration of the time-series characteristics of consecutive months within a fixed year. Second, a time series of consecutive years with the same month is constructed, setting the year step to 3 years. The Env_Fishery data of month m in years y−1, y−2, and y−3 are arranged to correspond to the CPUE values of month m in year y, enabling the exploration of the temporal characteristics of consecutive years with fixed months. Finally, the temporal-environmental-fisheries dataset Env_Fishery_Time is generated, and its data structure is shown in
Figure 4.
The Env_Fishery_Time dataset continues to be organized with the Env_Fishery dataset as the base unit and contains feature information in the three dimensions of environment, space, and time. The generated Env_Fishery_Time dataset focuses on January and August to December from 2006 to 2013 as the primary periods for studying fishery habitats due to the requirements of time-series steps in the dataset, and the remaining months serve as supplementary material to explore the relationship between various months and fishery habitats.
The five data lines retained each month after L1B data preprocessing are time-continuous and can reflect the dynamics of the MODIS data for that month. Therefore, the step size of the time series was set to 5 to match the fisheries’ data information for the current month and harmonize the temporal resolution of the L1B data and the fisheries’ data. Spatially, due to the large volume of MODIS data, the use of the matrix matching method may lead to model computational overload. To address this, the averaging method was chosen to adapt to the spatial resolution of the fisheries data and solve the problem of missing invalid values for some L1B data points. Following the matching process, the time-series-multi-spectrum dataset L1B_Fishery_Time is generated, spanning January 2006–2013 and August–December. Its data structure is shown in
Figure 5.