1. Introduction
The most sophisticated methods for performing various challenging tasks, such as image segmentation [
1] and classification [
2] in computer vision and image-analysis tasks, are convolutional neural networks (CNNs). Convolutions, nonlinear activations, and additional pooling operators are ample in each convolutional layer of a CNN, typically followed by one or more fully connected layers. CNNs are feedforward networks because the information is passed in only one direction from input to output. CNNs and artificial neural networks (ANNs) are rooted in biological principles. Their design is inspired by the brain’s visual cortex, alternately layered with simple and higher-order cells [
3]. There are many different types of CNN architectures, but they consist of convolutional and pooling layers built into modules. Beneath these modules are one or more fully connected layers, similar to standard feedforward neural networks. Modules are typically stacked on top of each other to build complex models [
4]. A typical CNN architecture for a toy-image-classification task is shown in
Figure 1. Feeding the images directly into the network involves a series of convolution and pooling layers. The representations formed by these processes are fed into one or more fully connected layers. The classifier finally provides a proof of evaluation for the fully linked layers. Although this is the most commonly used basic design in the literature, many improvements to the architecture have recently been proposed to improve image classification accuracy or reduce computational overhead.
Similar to feature extraction, convolutional layers learn feature representations of input images. Neurons in convolutional layers are organized into feature maps. A set of trainable weights, often called a filter bank, connects each neuron in the receptive field of the feature map with its neighbors in the underlying layer. We combine the input with the learned weights to create a new feature map and pass the output to a nonlinear activation function. Each neuron in a feature map is constrained to have equal weights since the weights of relatively different maps within the same convolutional layer are variable, so multiple features can be obtained at each point [
5]. The
output feature map
can be defined more formally as
where x represents the input image,
represents the convolution filter attached to the kth feature map, the 2D convolution operator is the multiplication sign in this context, and f (.) represents the nonlinear activation function. You can use this operator to compute the inner product of the filter model at each location in the input image.
Before the arrival of Convolutional Neural Networks (CNNs), traditional machine learning models such as Support Vector Machines (SVM) [
6] and K-Nearest Neighbors (KNN) [
7] were commonly employed for image classification, where each pixel was treated as an individual feature. The introduction of CNNs revolutionized this approach by using convolutional layers to extract multiple features from an image, enhancing the prediction of output values. Since the convolution operation is computationally intensive, pooling layers were integrated into CNNs to make the process more efficient. Pooling reduces the computational load by down-sampling the input, which decreases the number of computations required while preserving the most critical information. The pooling method streamlines the processing within the network, maintaining essential details with significantly lower resource consumption [
8].
While CNNs have significantly improved image classification, various enhancements have been proposed to further optimize performance, including hybrid models [
9] that combine CNNs with other machine learning algorithms to improve accuracy or reduce computational complexity. Recent advancements, such as adaptive pooling strategies and dynamic pooling methods, adjust pooling operations based on the input, allowing for better flexibility and feature retention [
10]. However, the impact of these newer methods on standard architectures like AlexNet, ResNet, and LeNet remains an area of active research.
This study aims to provide an overview of various pooling methods, discussing the benefits and drawbacks of each approach (see
Table 1). Additionally, we compare their performance in classification tasks using three distinct datasets.
The main contributions of this study include the following:
The proposed study systematically evaluated multiple CNN architectures—LeNet, AlexNet, and ResNet—across various datasets (MNIST, CIFAR-10, and CIFAR-100). This comprehensive analysis sheds light on how these models perform on datasets of differing complexities and sizes, providing insights into their adaptability and generalization capabilities across different image-classification tasks.
By presenting the comparative performance metrics, the proposed study identifies which CNN architectures excel or struggle when applied to specific datasets. This helps researchers to understand the strengths and weaknesses of each model in handling distinct image datasets, aiding in informed model selection for particular tasks or datasets.
This study provides a comprehensive comparison of standard pooling methods—max and average pooling—evaluated across different CNN architectures, including CNN, AlexNet, ResNet, and LeNet, on multiple datasets. While prior studies have discussed individual pooling methods, few have provided a systematic comparison in this context. Additionally, in this study, these methods were evaluated in light of recent advancements, highlighting the practical implications for resource-constrained environments.
The rest of this study is organized as follows:
Section 2 reviews work related to standard pooling methods proposed for computer vision and various image-analysis applications.
Section 3 presents the datasets and experimental procedures, reviews and presents the results and document provides a detailed discussion of our study, and
Section 4 summarizes the study.
2. Related Work
The publications included in this review were sourced through a thorough search that utilized combinations of the terms “Pooling”, “CNN”, and “Convolution” (along with related terms such as “convolutional”) across titles, keywords, and abstracts. Following an initial screening of the results, additional relevant literature was incorporated by carefully examining references and related works from the selected papers, with a focus specifically on the applications of CNNs. While some foundational studies, such as Yamaguchi’s introduction of max pooling in the early 1990s, are noted, the majority of pooling techniques and advancements have emerged in the last decade [
11].
Figure 2 illustrates a sustained interest in pooling research over the past eight years, with only minor fluctuations in publication frequency.
Two pooling groups are commonly employed in CNN for feature-reduction purposes. The first is local pooling, which samples small local regions, such as
, to display the feature map. The second is global pooling. It derives a scalar value from the feature vector of the image representation of each feature across the feature map [
12]. A fully connected layer takes all these representations and classifies them. In particular, the well-known Dense Net consists of one global pooling layer and four local pooling layers. The three commonly used types of pooling operations are max pooling, average pooling, and min pooling [
13]. This study discusses each pooling operation’s properties, advantages, and limitations.
2.1. Max Pooling
Max pooling is a simple operation widely used in CNNs due to its lack of tuneable parameters [
14]. The feature map’s spatial dimensions are enhanced by a mechanism, known as max pooling, that also provides network invariance. To accomplish network invariance, the k × k neighborhood is emphasized as having the highest value on the feature map. The max pooling method selects the largest element for each pooling zone. Considering sparse codes and simple linear classifiers, max pooling shows better performance. Due to these reasons, it has grown in popularity in recent years [
15]. Max pooling’s stochastic features allow it to handle sparse representations efficiently, which is yet another reason for its success. The mathematical expression for max pooling is
J-related filters are used for the composition of the mth max pooling band:
Here, is termed as a pooling shift, which allows for overlap within concerned pooling regions when N < R. The pooling layer reduces the output size from K convolution bands to the pooling region and the expected results for the resulting layer .
The primary limitation of max pooling lies in its selection of the maximum element from the pooling region while disregarding other values, potentially leading to the loss of distinguishing features and critical information. Studies have highlighted that, despite enhancing computational efficiency and reducing dimensionality in CNNs, max pooling can compromise spatial information and introduce inconsistencies in activations [
16]. Furthermore, in object-detection tasks, max pooling often results in poor localization accuracy, particularly for small or low-resolution objects [
17]. To mitigate these drawbacks, a novel approach, called Spatial Pyramid Pooling (SPP), has been proposed, which employs multiple pooling layers at varying spatial resolutions to better capture spatial information, demonstrating superior performance over max pooling in benchmark object detection datasets [
18].
Figure 3 illustrates the max pooling operation. In this example, the pooling region has an input size of 4-by-4, while the filter size, with a stride of 2, is 2-by-2. Max pooling extracts the maximum value of 20 from the first 2 × 2 segments (highlighted in green), and the highest values from each segment are selected to generate an output channel. However, max pooling only considers the largest value and ignores the others. As a result, when most elements have high values, significant features may be lost after max pooling, potentially leading to adverse outcomes.
2.2. Average Pooling
Down-sampling is performed by an average pooling layer by splitting the input into rectangular pooling areas and determining the average values of each region. The idea of extracting the feature by finding their average was first introduced by the authors of [
19]. The proposed idea was implemented in the first convolution-based deep neural network.
Figure 4 demonstrates the example of average pooling operation. The standard average pooling method divides the image input into several independent rectangular boxes. The average value for each box is determined, and the output channel is displayed. Average pooling is mathematically defined as
where x is a vector representing activations from a rectangular box of N permutation in an image or a channel (for example, the size of the rectangular area in
Figure 4 is 2 by 2. Average pooling used to be common, but with the arrival of the max pooling technique, its usefulness has been constrained [
20], where the shortcoming seems to mostly lead to a decline in information in terms of contrast. All of the activation values in the rectangular box are considered when estimating the mean. The estimated mean will indeed be low if the strength of all the activation functions is low, resulting in diminished contrast. Whenever the majority of the activations in the pooling region have a zero value, the scenario will get much worse. In that situation, the convolutional feature characteristic would be reduced significantly. Noise-inducing elements are reduced by averaging. However, since it gives each element in the pooling region equal priority, background regions may predominate in the pooled representation, which might diminish the discrimination power [
21].
Since neither max pooling nor average pooling consistently demonstrates superior performance, several techniques have emerged that combine the strengths of both methods, such as weighted pooling [
22] and soft pooling [
23]. These hybrid approaches introduce additional parameters, leading to increased learning time and computational overhead. However, these methods still face challenges, as they either prioritize the stronger activation or treat all activations equally, with existing studies primarily depending on activation values to address these limitations.
2.3. Min Pooling
Min pooling is a pooling operation that selects the minimum value within a sliding window, though it is less frequently used than max or average pooling, as it tends to preserve the smallest and least significant features of the input [
24]. However, it is beneficial in specific applications, such as anomaly detection and background subtraction, where detecting differences from a reference signal is essential. A comparison of standard pooling methods—max, average, and min pooling—along with their strengths and weaknesses, is presented in
Table 1. Studies indicate that max pooling is generally more effective for tasks requiring the capture of highly discriminative features, while average pooling offers greater robustness to noise and improved generalization by considering the overall context [
25].
Recent advancements in CNNs have introduced hybrid models and adaptive pooling strategies to improve performance in various tasks. For example, Khairandish et al. (2022) proposed a hybrid approach combining CNNs with Support Vector Machines (SVMs) for more accurate image classification in limited-resource environments, demonstrating that hybrid methods can outperform traditional CNNs in certain contexts [
26]. Similarly, Ding et al. (2024) introduced an adaptive pooling method for image text retrievel that adjusts pooling parameters based on input data, allowing the model to maintain more relevant features while reducing computational complexity [
27]. Li et al. (2024) explored the benefits of combining CNNs with other algorithms, like Long Short-Term Memory (LSTM) networks, enabling enhanced feature extraction for recommendation algorithm [
28]. Additionally, Han et al. (2019) conducted a survey of dynamic neural networks’ mechanism, which varies the pooling size depending on the input characteristics, offering better adaptability and precision for complex datasets [
29]. Zhao et al. (2024) presented a mixed-pooling strategy where max and average pooling were combined in a single layer, demonstrating improvements in accuracy on specific benchmarks. While these studies have significantly contributed to the enhancement of CNNs, particularly through hybridization and adaptive-pooling techniques, they often focus on specific architectures or use cases. In contrast, the proposed study offers a systematic and direct comparison of three widely used pooling methods—max, min, and average pooling—across multiple architectures (AlexNet, ResNet, and LeNet) on diverse datasets. This broader evaluation provides a more generalizable understanding of how different pooling techniques impact performance across standard CNN architectures. Unlike the aforementioned works, which typically propose new architectures or adaptations, this study focuses on refining the core understanding of pooling operations, offering practical insights for improving CNN performance in resource-constrained environments without introducing additional complexity.
3. Material and Methods
To understand the impact of pooling techniques on the performance of convolutional neural networks (CNNs), this study precisely analyzes three standard datasets, each chosen for its unique characteristics and challenges. The methodologies employed are designed to systematically evaluate and compare the effectiveness of different pooling strategies, thereby offering insights into their implications for CNN performance.
3.1. Datasets
This study used three standard datasets to evaluate the performance of pooling techniques across various convolutional neural network (CNN) architectures: MNIST [
30], CIFAR-10 [
31], and CIFAR-100 [
32]. These datasets represent a range of complexities, from simple handwritten digits (MNIST) to diverse object classes (CIFAR-10 and CIFAR-100). The selection of these datasets allows for a comprehensive analysis of how different pooling methods (max, average, and min pooling) impact model performance across varying levels of classification difficulty.
3.2. Model Architecture
Three widely adopted CNN architectures employed in this study are LeNet, AlexNet, and ResNet. These architectures were selected for their established performance and distinct structural characteristics, which provide a robust testing ground for evaluating different pooling strategies. LeNet is a smaller architecture primarily designed for simpler tasks, such as digit classification, and consists of convolutional and fully connected layers. In contrast, AlexNet features a deeper design with more convolutional layers, specifically engineered to address complex image classification problems. ResNet, recognized for its innovative use of residual connections, is deeper and more intricate than both LeNet and AlexNet, making it particularly well-suited for highly dimensional image-classification tasks. Together, these architectures create a comprehensive platform for analyzing the effects of various pooling techniques on model performance across a range of classification challenges.
3.3. Experimental Setup
The experiments in this study were conducted using Keras with TensorFlow as the backend framework, establishing a controlled environment for evaluating the performance of different pooling techniques across various CNN architectures. Each dataset, MNIST, CIFAR-10, and CIFAR-100, was divided using an 80/20 split, allocating 80% of the data for training and reserving 20% for testing. This division allowed for a thorough assessment of both recognition capabilities and generalization of performance of the proposed models. To ensure consistency and fairness in the evaluation, model training was executed under multiple configurations of batch sizes (32, 64, 128, and 256), learning rates (0.01, 0.001, and 0.0001), and dropout rates (ranging from 0.1 to 0.5, including a no-dropout condition). The Stochastic Gradient Descent (SGD) optimizer was utilized for the LeNet architecture, while the Adam and RMSProp optimizers were applied to AlexNet and ResNet, respectively, optimizing model performance for multi-class classification tasks. The categorical cross-entropy loss function guided the training process, with early stopping employed based on validation loss to mitigate overfitting. This comprehensive setup ensured a rigorous evaluation of each pooling method’s impact on model accuracy and generalization across diverse datasets. The proposed architecture of comparative approaches is presented in
Table 2,
Table 3,
Table 4 and
Table 5.
3.4. Result and Analysis
In our research, each comparative approach involving various neural network architectures, such as CNN, AlexNet, ResNet, and LeNet, was subjected to different configurations of batch size, learning rate, and dropout. Selecting distinct hyperparameters aimed to explore their impact on model performance and convergence across pooling techniques (max and average) on datasets like MNIST, CIFAR-10, and CIFAR-100. Given the sensitivity of these hyperparameters in influencing training dynamics, their variation resulted in divergent accuracy results across the models. Batch sizes were chosen to regulate the number of samples processed per iteration, affecting gradient updates and the convergence speed. Learning rates played a critical role in controlling the step size during optimization, impacting the model’s ability to navigate the loss landscape. Additionally, dropout rates were manipulated to mitigate overfitting by randomly deactivating neurons during training, affecting the model’s generalization capability. Consequently, the discrepancy in accuracy outcomes underscores the nuanced interplay between these hyperparameters and their consequential effect on model learning and generalization across different network architectures and pooling strategies.
3.5. Performance Evaluation in Terms of Accuracy
This section provides a comprehensive comparative analysis of the accuracy of the MNIST, CIFAR-10, and CIFAR-100 datasets across various convolutional neural network (CNN) architectures, including CNN, LeNet, AlexNet, and ResNet. The performance of these architectures is evaluated under different batch sizes, learning rates, and dropout rates, focusing on the effects of max pooling and average pooling techniques. The objective is to identify which pooling technique and architectural configuration yields the best performance for each dataset, thereby offering valuable insights into optimizing CNN models for diverse image-classification tasks. The detailed results of comparative approaches are presented in
Table 6,
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11.
The comparative analysis of CNN, AlexNet, ResNet, and LeNet using max and average pooling methods under varying batch sizes and dropout rates highlights the distinct impact of pooling strategies on model performance from
Table 6,
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11. Under max pooling conditions, AlexNet consistently outperformed other models, showcasing its ability to extract dominant features from the data efficiently. This superior performance is evident in AlexNet’s stable and high classification accuracy across different configurations, even as dropout rates and batch sizes varied. The architecture of AlexNet is particularly suited for capturing the most salient features, which is effectively facilitated by the max pooling method. In contrast, CNN emerged as the second-best performer in the max pooling setup. Although it demonstrated strong feature-extraction capabilities, CNN showed higher sensitivity to changes in hyperparameters compared to AlexNet, indicating some variability in performance.
When average pooling was employed, the performance dynamics shifted, with CNN taking the lead. CNN’s architecture leveraged the generalization provided by average pooling to smooth feature representations, resulting in more stable and consistent accuracy, particularly at moderate dropout rates and optimized batch sizes. This suggests that average pooling is effective for CNN, as it promotes a balanced feature representation across spatial dimensions and reduces overfitting. AlexNet, while still achieving respectable accuracy, performed suboptimally with average pooling. Its reliance on max pooling for optimal feature extraction limited its ability to fully exploit the benefits of average pooling, which is better suited for models that require feature smoothing rather than emphasizing the most activated features.
Both ResNet and LeNet exhibited relatively lower performance with both pooling methods. ResNet’s complex architecture with skip connections did not gain significant benefits from either pooling strategy, while LeNet’s simpler structure was unable to fully capitalize on the feature-extraction capabilities of max or average pooling. Overall, this study underscores the importance of selecting pooling strategies that align with the architectural strengths of deep learning models. The findings demonstrate that AlexNet excels with max pooling, while CNN performs best with average pooling, highlighting the necessity of adaptive pooling approaches to optimize model accuracy based on the specific dataset and model architecture.
3.6. Statistical Significance Analysis
To further validate the robustness and reliability of the comparative results between max and average pooling methods across standard datasets (CIFAR-10, CIFAR-100, and MNIST) using CNN, AlexNet, ResNet, and LeNet, a statistical significance analysis was conducted. Specifically,
p-values were calculated to assess whether the differences in performance metrics, such as accuracy, precision, and computational efficiency, were statistically significant. The
p-values provide an objective measure to determine whether the observed differences between pooling methods are due to random variations or represent true distinctions in performance. [
33]. A significance threshold of 0.05 was adopted, where
p-values below this level indicate that the performance differences are statistically significant and unlikely to have occurred by chance.
Table 12 shows the comparative analysis of
p-values on standard datasets.
The
p-value results highlight the comparative performance of max pooling and average pooling across various models and datasets, demonstrating that max pooling consistently yields more statistically significant results in many cases. For instance, in the MNIST dataset, max pooling shows superior performance, especially with LeNet, where its
p-values are significantly lower across all learning rates, indicating stronger statistical significance compared to average pooling. This trend is also evident in the CIFAR-10 dataset, where max pooling exhibits better robustness, particularly at a learning rate of 0.01, with a
p-value of 0.0042 compared to 0.0001 for average pooling. Similarly, in the CIFAR-100 dataset, max pooling performs better at lower learning rates, maintaining strong statistical significance with more consistent
p-values across architectures like CNN and LeNet. While average pooling occasionally shows lower
p-values, particularly in deeper models like ResNet, max pooling demonstrates greater consistency and reliability across the board. Overall, max pooling emerges as the more robust and reliable pooling method, offering better statistical significance and performance stability across a variety of datasets and architectures.
Figure 5 illustrates the comparative analysis of
p-values for max pooling and average pooling across different neural network architectures and data.
3.7. Convergence Graph
A convergence graph visualizes the learning progress of a model over time, illustrating the effectiveness of various training parameters and architectures. This analysis focuses on the convergence graphs of four neural network architectures, CNN, AlexNet, LeNet, and ResNet, generated using Matplotlib, as shown in
Figure 6. Key observations from the comparative analysis reveal that CNN and AlexNet exhibit stable convergence with minimal oscillations, indicating reliable training processes. In contrast, AlexNet converges quickly due to its deeper architecture, while ResNet shows initial instability but ultimately achieves strong performance, reflecting its robustness. AlexNet and ResNet demonstrate rapid initial improvements due to their complex designs, effectively capturing intricate patterns, whereas CNN and LeNet show more gradual progress. The plateau in AlexNet and stabilization in ResNet suggest these models reach optimal performance quickly, while CNN and LeNet may require more epochs for full optimization. Overall, the graphs highlight the distinct strengths of each architecture: CNN and AlexNet provide stable learning for simpler tasks, AlexNet excels in fast early stage learning for complex data, and CNN, despite initial fluctuations, demonstrates high performance and robustness. Adjusting hyperparameters like learning rate and batch size can further enhance these architectures for use in specific datasets and tasks.
3.8. Analysis of Optimal Pooling Performance and Parameter Trends
This section identifies and discusses the optimal performance achieved by max and average pooling across the MNIST, CIFAR-10, and CIFAR-100 datasets while also addressing the observed trends in performance based on parameter variations.
The CNN performs best with average pooling, achieving higher accuracy on MNIST and leveraging this pooling method to generalize effectively across the dataset. Specifically, CNN peaks with average pooling at 98.82% accuracy on MNIST using a learning rate of 0.001 and a batch size of 32 with a low dropout rate, suggesting that CNN benefits from the smoothed feature extraction of average pooling. In contrast, ResNet exhibits superior performance with max pooling, particularly on CIFAR-10 and CIFAR-100, where complex features demand higher selectivity. For instance, ResNet’s optimal CIFAR-10 performance with max pooling reaches 54.68% at a learning rate of 0.0001 and dropout of 0.4–0.5, showcasing its alignment with max pooling’s concentrated feature selection.
On the MNIST dataset, each model displays varied responses to pooling methods. LeNet performs moderately, achieving its peak of 98.9% accuracy with average pooling and a learning rate of 0.001 on batch size of 128, though it generally trails CNN and AlexNet. AlexNet demonstrates robustness with both pooling methods, attaining high performance with max pooling, especially on CIFAR-10, where it reaches 67.89% accuracy with a learning rate of 0.001 and moderate dropout. This suggests that AlexNet’s deeper architecture handles CIFAR-10′s feature complexity well under max pooling. ResNet, which excels at retaining intricate feature details, benefits significantly from max pooling across datasets, particularly CIFAR-100, achieving around 17.13% accuracy. Although this result is lower than with simpler datasets, it reflects ResNet’s high selectivity in feature extraction.
On CIFAR-100, which is more challenging due to its 100-class complexity, CNN’s accuracy peaks at 22.26% with max pooling, again showing a preference for structured feature selection. LeNet, on the other hand, struggles to maintain high accuracy across both pooling methods, achieving only around 13.5% accuracy on CIFAR-100, pointing to limitations in its simpler architecture for such complex data. AlexNet continues to perform relatively well on CIFAR-100, reaching its highest accuracy at around 30.89% with max pooling, while ResNet achieves its best result with max pooling, reaching around 17.13%. The analysis confirms that CNN benefits most from average pooling on MNIST, while ResNet performs best on max pooling across CIFAR datasets. This pattern underlines the importance of aligning pooling techniques with model architecture and dataset complexity for optimal performance.
The observed parameter trends in
Table 6,
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11 reveal the importance of the batch size, learning rate, and dropout rate adjustments, which significantly impact pooling performance across different architectures and datasets. Lower learning rates and moderate batch sizes generally enhanced model stability and accuracy, especially for complex datasets like CIFAR-100. Higher dropout rates tended to improve generalization in average pooling, which smooths feature representations, though they occasionally hindered accuracy in configurations where feature retention was critical, as in max pooling. These trends underscore that max pooling often benefits from moderate batch sizes and minimal dropout to retain strong activations, whereas average pooling performs optimally with larger batch sizes and higher dropout rates, especially for noisier datasets. It is important to note that while these trends offer insight, generalizing them across all applications is challenging, as parameter effects can vary significantly depending on dataset characteristics and task requirements. Adaptive parameter tuning based on dataset-specific needs could provide more reliable results in future studies. Overall, these findings suggest that selecting pooling methods based on dataset complexity and noise levels is crucial, particularly when applying CNNs to resource-constrained environments, where optimized configurations can greatly enhance performance and generalization.
3.9. Discussion
This research provided an in-depth analysis of various pooling methods in Convolutional Neural Networks (CNNs), specifically max pooling, across datasets such as MNIST, CIFAR-10, and CIFAR-100. The results demonstrate that each method has unique strengths and limitations that make it suitable for different tasks. Max pooling consistently performed well in scenarios where preserving high-contrast features and robustness to noise were critical. The ability of max pooling to capture the most prominent features from the feature maps allowed it to excel in classification tasks with high-dimensionality input, such as the CIFAR-100 dataset. However, the technique’s downside lies in its tendency to discard potentially useful information, which may explain its reduced performance when applied to datasets with smaller objects or more intricate details, where preserving all information is important. This is particularly relevant in applications such as small object detection, where discarding finer details may lead to poor localization accuracy, as noted in several studies on object detection. In contrast, average pooling showed more balanced feature representation and performed well in complex datasets, such as CIFAR-10. By smoothing out noise and reducing the impact of any outlier values in the feature map, average pooling can generalize better in tasks where the input is noisy or where capturing the overall context is more important than highlighting specific high-contrast features. For instance, in semantic segmentation tasks, where each pixel’s classification matters more than a focus on high-intensity regions, average pooling may outperform max pooling. However, the downside of this method is its inability to preserve sharp edges and fine details, which are crucial in high-precision tasks like medical image analysis or object detection. Min pooling, though less commonly used, showed its utility in highly specialized tasks such as anomaly detection and background subtraction. By focusing on the least prominent features, min pooling can highlight anomalies or subtle differences in an image, which can be critical in applications like fraud detection or medical imaging, where identifying outliers or rare features is essential. However, the sensitivity of min pooling to noise limits its general applicability in mainstream image-classification tasks, where higher-contrast features dominate.
The results also underscore that no single pooling method universally outperforms others across all tasks, architectures, and datasets. The effectiveness of a pooling technique is highly dependent on the specific task and dataset characteristics. Max pooling might be preferable for tasks involving object detection or datasets with large, prominent features, whereas average pooling would be a better choice for more balanced, noisy datasets requiring more generalization, such as in semantic segmentation. Moreover, recent advancements in CNN architectures, such as hybrid pooling methods and adaptive pooling strategies, have sought to combine the strengths of multiple pooling operations. For example, adaptive pooling methods, which dynamically adjust the pooling strategy based on the input, have shown promise in enhancing performance by balancing feature preservation and computational efficiency. These approaches allow CNNs to adapt better to the varying complexities of different datasets, especially for tasks requiring fine-grained classification like medical imaging and high-precision applications. In practical terms, the choice of pooling method has important implications for resource-constrained environments. Max pooling offers high accuracy but at the cost of potentially discarding important information, while average pooling provides a more generalizable solution but with potential loss of detail. Min pooling is effective for specialized tasks but may not be suitable for broader applications. Future research could benefit from exploring adaptive pooling techniques that optimize the trade-offs between computational cost and feature retention, ensuring that the choice of pooling method aligns with specific task requirements.