Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison

İşler, Buket

doi:10.3390/app15031201

Open AccessArticle

Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison

by

Buket İşler

Department of Software Engineering, Istanbul Topkapi University, Istanbul 34087, Türkiye

Appl. Sci. 2025, 15(3), 1201; https://doi.org/10.3390/app15031201

Submission received: 11 December 2024 / Revised: 14 January 2025 / Accepted: 19 January 2025 / Published: 24 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Rapid urbanization presents significant challenges in energy consumption, noise control, and environmental sustainability. Smart cities aim to address these issues by leveraging information technologies to enhance operational efficiency and urban liveability. In this context, urban sound recognition supports environmental monitoring and public safety. This study provides a comparative evaluation of three machine learning models—convolutional neural networks (CNNs), long short-term memory (LSTM), and dense neural networks (Dense)—for classifying urban sounds. The analysis used the UrbanSound8K dataset, a static dataset designed for environmental sound classification, with mel-frequency cepstral coefficients (MFCCs) applied to extract core sound features. The models were tested in a fog computing architecture on AWS to simulate a smart city environment, chosen for its potential to reduce latency and optimize bandwidth for future real-time sound-recognition applications. Although real-time data were not used, the simulated setup effectively assessed model performance under conditions relevant to smart city applications. According to macro and weighted F1-score metrics, the CNN model achieved the highest accuracy at 90%, followed by the Dense model at 84% and the LSTM model at 81%, with the LSTM model showing limitations in distinguishing overlapping sound categories. These simulations demonstrated the framework’s capacity to enable efficient urban sound recognition within a fog-enabled architecture, underscoring its potential for real-time environmental monitoring and public safety applications.

Keywords:

IoT; fog computing; urban sound recognition; deep learning; smart city

1. Introduction

As the global population continues to grow, it is estimated that by 2050, over six billion people—70% of the world’s population—will reside in urban areas [1]. This rapid urbanization has brought significant challenges, including increased energy consumption, noise pollution, and environmental degradation. Smart cities aim to address these issues by leveraging advanced information and communication technologies (ICT) to optimize municipal services, enhance urban life quality, and improve governance efficiency [2,3,4]. Urban sound recognition systems play a critical role in smart cities by analyzing and classifying diverse acoustic environments such as traffic noise, sirens, and human activities. These systems provide essential tools for monitoring noise pollution, ensuring public safety, and fostering sustainable urban development [5].

The Internet of Things (IoT) underpins smart city applications by enabling seamless data transmission between interconnected sensor devices. This ecosystem facilitates essential tasks, including environmental monitoring and urban sound recognition, by providing a foundation of effective sensing, extensive connectivity, and intelligent data analysis [6,7,8,9]. However, the increasing volume of data generated by IoT devices presents challenges related to latency, bandwidth utilization, and scalability when managed through traditional centralized cloud systems. These limitations necessitate more efficient architectures capable of handling real-time data processing.

Fog computing has emerged as a solution to these challenges by decentralizing data processing to nodes closer to IoT devices. Unlike cloud computing, which relies on centralized data centers, fog computing reduces latency, optimizes bandwidth usage, and enhances data privacy. This decentralized paradigm is particularly effective for applications such as urban sound recognition, where low-latency responses are critical for real-time noise monitoring and public safety interventions [10,11]. By bridging the gap between edge and cloud computing, fog computing combines the computational efficiency of localized processing with the scalability of cloud infrastructures. It also provides intermediate nodes with enhanced processing capabilities, enabling the execution of complex operations such as deep-learning-based sound classification [12,13].

Traditional urban-sound-recognition approaches have relied heavily on feature engineering methods, such as MFCCs, Chroma features, Spectral Contrast, and Zero-Crossing Rate (ZCR). These features were typically processed using machine learning algorithms like Support Vector Machines (SVMs), Random Forests, and k-Nearest Neighbors (k-NN), or statistical models including Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) [14,15,16]. While these methods demonstrated effectiveness in specific scenarios, they struggled to adapt to noisy, real-world soundscapes due to their reliance on handcrafted features. This limitation has driven the adoption of more advanced methods capable of learning directly from raw data.

Deep learning models have marked a significant advancement in urban sound recognition by enabling automated feature extraction and substantially improving classification accuracy. CNNs have proven particularly effective for analyzing spectral features of audio data. By employing convolutional layers, CNNs capture local patterns in spectrograms, making them suitable for identifying intricate soundscapes. This capability is further enhanced by pooling layers, which reduce spatial dimensions and focus on the most relevant features, thereby improving computational efficiency. CNNs are well suited for handling large-scale datasets due to their hierarchical feature learning, enabling a robust classification of diverse sound categories [17].

LSTM networks, on the other hand, are designed for processing sequential data and excel in capturing temporal dependencies in acoustic signals. By using memory cells and gating mechanisms, LSTM networks can retain long-term dependencies while mitigating the vanishing gradient problem common in traditional recurrent neural networks. This makes LSTM networks particularly effective for recognizing patterns in time-series data, such as fluctuating sound intensities or rhythmic acoustic events [18,19]. However, their reliance on sequential processing can result in higher computational demands and slower inference times, particularly when applied to large datasets.

Dense neural networks, also referred to as fully connected networks, provide a flexible framework for sound classification by connecting each neuron in one layer to every neuron in the subsequent layer. This architecture allows Dense networks to model complex relationships between features, making them suitable for analyzing large datasets where intricate patterns and dependencies exist. However, this dense connectivity can lead to increased computational costs, requiring optimization techniques such as dropout regularization and batch normalization to prevent overfitting and enhance efficiency [20,21].

Despite their significant capabilities, deep learning models often require substantial computational resources. For example, CNNs and Dense networks rely heavily on GPU acceleration to process large-scale spectrograms, while LSTM models require substantial memory for maintaining temporal state information over long sequences. These demands often exceed the processing capacities of IoT devices, which are constrained by limited energy efficiency and computational power [22]. This limitation presents a critical challenge for deploying these models directly on IoT hardware in real-time urban sound recognition applications [23].

To address these constraints, fog computing provides a robust platform for integrating IoT and deep learning. This study evaluates the performance of CNN, LSTM, and Dense models within a fog computing framework, using the UrbanSound8K dataset, which contains 8732 recordings representing diverse urban sound categories. MFCCs were employed as the primary feature extraction method to capture key acoustic features, and simulations were conducted on the AWS platform to assess model performance using metrics such as precision, recall, macro F1-score, weighted F1-score, and confusion matrix. The results revealed that the CNN model achieved the highest accuracy (90%), followed by Dense (84%) and LSTM (81%) models. The CNN model’s superior performance is attributed to its ability to effectively capture spectral features, whereas the LSTM model exhibited limitations in distinguishing overlapping sound categories. Dense models, while effective, performed slightly below the CNN model in handling complex spectral characteristics.

In this study, although real-time data were not employed, the simulated fog computing environment was designed to provide a realistic assessment of model performance under conditions relevant to smart city applications. The findings demonstrate the feasibility of integrating fog computing with advanced deep learning models for scalable and efficient urban sound recognition. The proposed framework shows significant potential for real-time applications in environmental monitoring and public safety, underscoring the importance of decentrald architectures in future smart city systems.

The contributions of this study include a comprehensive evaluation of CNN, LSTM, and Dense models for urban sound classification within a fog computing framework, the validation of MFCC-based feature extraction for urban sound data, demonstrating its effectiveness in capturing complex acoustic features, and insights into fog computing’s ability to reduce latency and optimize computational efficiency for scalable smart city applications.

The structure of this paper is as follows: Section 2 provides an in-depth discussion of the IoT architecture and its significance in smart city applications, establishing the technical foundation for this study. Section 3 outlines the dataset and methodology, describing the data preprocessing steps and the architectures of the CNN, LSTM, and Dense models employed in the analysis. Section 4 presents the experimental results, including a comparative evaluation of model performance based on key metrics. Section 5 offers a detailed discussion of the findings, highlighting their implications for urban sound recognition systems and smart city implementations. Finally, Section 6 summarizes the key conclusions of the study and proposes directions for future research.

2. IoT and Smart City

2.1. Internet of Things (IoT)

IoT refers to a network architecture that enables physical objects to communicate with each other and with other systems over the internet. This technology is applied across various domains, from daily life to industrial applications, facilitating the development of more efficient, intelligent, and interconnected systems. The concept of IoT was first introduced in 1999 at the Auto-ID Laboratories of the Massachusetts Institute of Technology. In 2005, the International Telecommunication Union (ITU) provided an official definition in the “ITU Internet Report 2005: Internet of Things”. IoT can be defined as a system in which all objects interact with each other through technologies such as sensors and mobile devices to perform a common function [24]. Various definitions of IoT have been proposed in the literature. It is often described as a platform that enables seamless communication and appropriate information sharing among sensors and digital devices within an intelligent environment. It is also seen as a comprehensive system regulating data sharing, communication, and decision-making processes among physical objects [7,25].

IoT devices continuously communicate through wireless technologies such as Bluetooth, WiFi, ZigBee, WSN, LPWAN, and cellular networks. These devices are directly integrated with the physical world, receiving commands from computer-based systems to manage data and enhance living standards. It is anticipated that over 50 billion devices, including sensors, smartphones, laptops, and gaming consoles, will be connected to the internet via heterogeneous access networks enabled by technologies like radio-frequency identification (RFID) and wireless sensor networks [6]. An IoT system generally comprises devices, infrastructure, services, and applications, which are organized into several layers. As depicted in Figure 1, IoT architecture consists of several layers. The Interface Layer provides a user-friendly interface for interacting with the system, facilitating the presentation of information and enabling user interaction with IoT applications. It plays a crucial role in managing communication between the user and the underlying IoT infrastructure, ensuring a seamless user experience. The Service Layer manages and delivers services based on processed data, ensuring the smooth operation of IoT applications by handling tasks such as data analytics, resource allocation, and service orchestration. Positioned between the Network Layer and the Service Layer, the Fog Layer enables local data processing, preprocessing, and filtering near the data source. Specifically, raw data from IoT sensors are transmitted to fog nodes, where tasks such as noise filtering, initial feature extraction, and data compression are performed. These preprocessing tasks reduce the burden on centralized Cloud Servers, significantly minimise latency, and enable real-time responses for applications requiring immediate action. Additionally, the Fog Layer can host pre-trained machine learning models for tasks such as anomaly detection or noise classification, further enhancing the system’s responsiveness. This approach ensures efficient data handling, improves scalability, and reduces reliance on Cloud Servers by distributing the computational workload closer to edge devices [26]. The Network Layer ensures connectivity between devices and the internet or other networks, facilitating data transmission via various wireless technologies such as Bluetooth, WiFi, ZigBee, and LPWAN [27]. Lastly, the Sensing Layer captures and collects data through smart sensors, RFID devices, and other IoT-enabled components, providing foundational real-world information that is transmitted to higher layers for processing [28]. By incorporating the Fog Layer into the traditional IoT architecture, this study aims to demonstrate how real-time data processing and local decision-making can be enhanced. The inclusion of machine learning algorithms at the fog nodes further supports the development of intelligent systems capable of handling large-scale data efficiently.

2.2. Smart City

Smart cities are described as an urban management model where ICT is integrated to enhance the quality of life, ensure sustainability, and efficiently use resources. This model facilitates the digitalization of urban infrastructures, services, and management processes to create more liveable and sustainable cities. The importance of the smart city concept has been increasingly emphasized by the development of technologies such as big data, IoT, artificial intelligence, and cloud computing [30]. However, it is important to note that for the successful implementation of this concept, the focus should not solely be on technology but also on the societal and environmental impacts of these advancements. In smart cities, technologies such as sensors, IoT devices, big data analytics, and cloud computing are commonly utilized. These technologies continuously collect and analyze data across various urban domains. For example, they are employed in areas such as traffic management systems, smart lighting, and energy management. However, it must be recognized that the impact of smart city applications extends beyond the technical dimension, and attention should also be given to the long-term societal effects of these technologies. Data collected by cities are analyzed to improve the efficiency of management processes. These data encompass a wide range, from traffic flow and air quality to energy consumption and water usage. Through data analytics, decision-making processes are enhanced, enabling a more rapid resolution of issues. Nonetheless, the security and privacy of these data must be ensured. Sustainability is one of the fundamental objectives of smart cities, and achieving this goal necessitates raising societal awareness [31]. In conclusion, smart cities represent a transformative shift in urbanism, driven by the integration of technology into urban life. Guided by the principles of sustainability, efficiency, and liveability, smart cities are not merely a vision but a tangible reality shaping the future of urbanism. The key components of smart cities have been outlined in the literature [2].

Digital Infrastructure and Data Management: In smart cities, data are continuously collected and analyzed through sensors and other IoT devices. These data are utilized in areas such as traffic management, energy consumption, and environmental monitoring, thereby enhancing the efficiency of urban management [32]. However, attention must also be given to data privacy and security.

Smart Transportation and Mobility: Smart transportation systems optimize traffic flow, improve public transportation, and reduce carbon emissions. Real-time data and AI-based systems are employed to provide road users with the most efficient routes [3]. Additionally, the social acceptance and accessibility of these systems should be considered.

Sustainable Energy Management: In smart cities, renewable energy technologies and smart grid systems are integrated to manage energy resources efficiently. This approach reduces costs and mitigates environmental impacts. However, the scalability and sustainability of these systems must be assessed and supported to ensure the long-term viability of smart cities.

Smart Health and Education Services: The integration of digital technologies into health and education services not only enhances efficiency but also promises a future of more accessible and equitable services for citizens. Applications such as telemedicine and remote education are part of this transformation [33]. Preventing the digital divide and ensuring equitable access to these services are crucial to realizing the full potential of these advancements.

Community Participation and E-Governance: Smart cities are designed to empower citizens, increasing their role in decision-making processes. Through digital platforms, citizens can provide feedback and gain quicker access to urban services. This emphasis on community participation and transparency ensures that citizens are not merely beneficiaries but integral to the success of smart cities.

Figure 2 illustrates a smart campus application, which serves as a conceptual example for developing similar architectures in smart city applications [34]. This architecture, when adapted to smart cities, can be extended to incorporate a wider range of IoT devices and services, ensuring efficient real-time data processing and enhanced urban management through fog computing.

3. Material and Methodology

3.1. Data Set

This study utilizes the UrbanSound8K dataset, a publicly available resource designed specifically for urban sound classification tasks. The dataset consists of 8732 audio clips, each ranging in length from one to four seconds, providing a rich variety of urban soundscapes. These audio segments are categorized into 10 distinct classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music. This variety ensures that the dataset encompasses a broad range of typical urban sounds, facilitating a thorough assessment of sound classification models.

The UrbanSound8K dataset offers a valuable testing ground for evaluating the performance of machine learning models in identifying diverse urban sounds. By including such varied sound categories, this dataset allows researchers to analyze how well a model can differentiate between common yet distinct urban auditory inputs, such as background noises (e.g., air conditioner and engine idling) and alerting sounds (e.g., siren and gunshot). This diversity is essential for testing the model’s capacity to handle real-world auditory challenges in urban environments.

To ensure a structured approach to training and evaluation, the dataset was split into training and testing subsets, with 80% of the data allocated for training and 20% reserved for testing. This partitioning supports a robust evaluation of model performance on unseen data, enhancing the reliability of classification results.

3.2. Methodology

This study utilizes the UrbanSound8K dataset to conduct urban sound classification, simulating a real-world urban sound recognition framework in the absence of real-time data. The proposed system is structured into three primary components: the End Device, the Fog Layer, and the Cloud Server (Figure 3). Initially, audio samples are processed using pre-recorded segments from the UrbanSound8K dataset, which are stored in WAV format to maintain consistency across the dataset. Several preprocessing techniques are applied on the End Device, including spectral gating for noise reduction, min–max normalization to standardize input data, and silence removal to isolate relevant sound patterns. Furthermore, MFCCs (mel-frequency cepstral coefficients) are extracted as key features to capture the spectral properties of the audio signals.

Subsequent to the preprocessing stage, the audio data are locally classified on the End Device using three distinct deep learning models: CNN, LSTM, and Dense networks. Each model operates independently to produce classification results. The Fog Layer functions as an intermediary component, facilitating data transmission and reducing latency by ensuring efficient communication between the End Device and the Cloud Server. In this architecture, the Fog Layer supports a low-latency communication environment by minimizing the time required for data exchange, thereby simulating real-time conditions.

The Cloud Server, deployed on AWS, is tasked with receiving and securely storing the classification results transmitted through the Fog Layer. This multi-layered architecture, comprising local data processing at the End Device, fog computing for optimized data transmission, and cloud computing for scalable storage, serves as a foundational framework for future real-time urban sound recognition systems.

4. Experimental Results

4.1. CNN Model Performance Results

This study used MFCCs for feature extraction, with the coefficient value set to 60. This value was chosen based on preliminary testing to balance computational efficiency and the richness of extracted features, capturing essential sound characteristics without overwhelming the model. For the training and testing processes of the emotional speech recognition system, a 5-fold cross-validation approach was adopted. Initially, labels were converted to numerical values using the LabelEncoder method, followed by applying one-hot encoding for use in the training process. The data were divided into 80% training and 20% validation subsets within each fold, ensuring that class balance was maintained throughout the process. Random_state = 0 was used to guarantee reproducibility, and the balanced distribution of samples across different classes in both training and validation sets was ensured.

The CNN model architecture consisted of an initial convolutional layer with 32 filters of size 5 × 5 to capture local spatial patterns in the mel-spectrograms, followed by a ReLU activation function to introduce non-linearity and a max pooling layer of size 2 × 2 to reduce dimensionality. These hyperparameters were selected based on common values in the literature and validated by experimentation. The feature maps were then flattened into a one-dimensional vector, followed by a Dense layer with 128 neurons to fully connect the network and enhance the model’s ability to learn complex sound features. To prevent overfitting, dropout layers with a dropout rate of 0.3 were applied, ensuring that the model does not learn noise or irrelevant patterns.

Hyperparameters were determined according to the following criteria: the number of filters (32) was considered sufficient to capture essential features, and the kernel size (5 × 5) was found to be suitable for identifying local features in the sound spectrogram. The number of epochs (50) was selected based on prior experimentation to ensure adequate training without overfitting.

During the grid search process, the following candidate values for key hyperparameters were evaluated: filter count in the range of [16, 32, 64], kernel sizes of [3 × 3, 5 × 5], and dropout rates of [0.3, 0.5, 0.7]. The optimal configuration—32 filters, a 5 × 5 kernel, and a 0.3 dropout rate—was selected based on achieving the highest validation accuracy. This approach ensured that the selected hyperparameters provided a balance between model complexity and generalization performance.

During training, the CNN model’s accuracy ranged from 31.4% at the initial stages, gradually improving to 94.1% as the model learned to identify more intricate sound patterns. This improvement was due to the iterative learning process, where each epoch allowed the model to adjust weights based on error minimization. Validation accuracy also followed a similar trend, ranging from 52.7% to 90.9%, which demonstrates that the model generalized well as training progressed. Training loss varied from 2.27 to 0.16, while validation loss ranged between 1.42 and 0.38, indicating that the model learned effectively while avoiding overfitting due to the regularization techniques employed.

As a result, the model trained with the selected hyperparameters achieved an accuracy of 90% on the test data. These results demonstrate that the chosen hyperparameters were suitable for the problem space and that the model performed effectively overall.

As shown in Table 1, in individual sound categories, the model exhibited superior performance in recognizing distinct sounds such as car horn, engine idling, siren, and gunshot, with accuracy and F1-scores exceeding 95% for car horn and siren. These results suggest that these sounds have unique acoustic features, making them more easily distinguishable by the model. However, the accuracy and F1-scores for categories like children playing and dog bark ranged between 81% and 84%, indicating slightly lower performance. This discrepancy likely stems from overlapping frequencies or similar acoustic patterns with other sound types, which could lead to occasional misclassification. This observation is consistent with findings in the literature, where complex or less distinct sounds can be challenging for machine learning models.

Overall, the CNN model’s results were competitive compared to similar studies in urban sound recognition, showing that the chosen architecture and preprocessing steps were effective. The consistent macro and weighted F1-scores highlight the model’s ability to perform uniformly across diverse categories, contributing to the broader goal of balanced classification in urban sound environments. These results demonstrate the potential of the CNN model within a fog computing architecture for efficient sound recognition in smart cities, though further testing with real-time data could provide additional insights into the system’s practical applicability.

The confusion matrix shown in Figure 4 reveals the model’s tendencies to confuse different sound categories. Categories such as air conditioner, car horn, engine idling, and jackhammer were generally classified correctly.

However, some confusion was observed between children playing and dog bark, children playing and drilling, and children playing and street music. Particularly, the children playing category was confused with several different sounds. Although categories like gunshot, siren, and drilling were mostly classified correctly, occasional misclassifications did occur. These results indicate that while the model performs with high accuracy overall, there is some confusion between similar sounds. The model’s performance is generally positive, as indicated by the high correct classification rates and relatively low confusion rates.

Overall, the model was observed to correctly classify most sound categories, although some confusion was noted, particularly in the “children playing” and “drilling” categories. This suggests that certain sounds may share similar features, making it challenging for the model to differentiate them. The model’s overall performance is strong, as evidenced by high accuracy rates and minimal confusion.

4.2. LSTM Model Performance Results

The analysis process began with the extraction of MFCC features, using the n_mfcc = 40 parameter. This parameter was chosen to adequately represent the spectral characteristics of sound signals with sufficient resolution and is commonly preferred in the literature. Alternative n_mfcc values (e.g., 20, 30, and 50) were tested, and 40 was found to provide an optimal balance between feature richness and computational efficiency. Three-second segments were extracted from the audio files with a 0.5 s offset. These parameters were evaluated to provide an appropriate temporal resolution for capturing the characteristic features of emotional expressions. The resulting MFCC matrices were converted into a format suitable for the LSTM model and standardized using the pad_sequences_2d function. The padding strategy was set to “post”, ensuring that critical information at the beginning of the feature sequences was preserved.

The dataset was processed using a 5-fold cross-validation approach, employing the StratifiedKFold class to ensure balanced class distributions across folds. Labels were converted to numerical values using the LabelEncoder method and one-hot encoded for training. An 80–20% training-validation ratio was maintained within each fold to ensure a proportional representation of classes.

The LSTM model consisted of three layers, with 128 neurons in the first layer, 256 neurons in the second layer, and 128 neurons in the third layer. A relatively high dropout rate of 50% was applied after each LSTM layer to prevent overfitting as initial experiments indicated that lower dropout rates (e.g., 20–30%) resulted in insufficient regularization. The return_sequences = True parameter was used in all LSTM layers to preserve temporal information for subsequent layers. The softmax activation function was used in the output layer for multi-class classification.

During training, the Adam optimization algorithm was used as the optimizer due to its adaptive learning rate strategy, which allows for an efficient convergence and effective handling of gradient issues. The batch size was set to 32, which was observed to provide adequate generalization performance while efficiently utilizing modern GPUs’ memory capacity. The number of epochs was set to 100. This number was determined based on observations of the convergence behavior of the training and validation curves. Using fewer epochs resulted in insufficient learning, while increasing the number of epochs beyond 100 significantly raised computational costs without notable improvement in performance. Early stopping was not employed in this study as it was considered beneficial for the model to undergo uninterrupted training for the specified number of epochs to fully learn the patterns in the dataset. This decision was based on the assumption that uninterrupted training would allow the model to fully learn patterns in the UrbanSound8K dataset without premature halting. Compared to similar studies on urban sound recognition using the UrbanSound8K dataset, where typical LSTM models achieve between 78% and 83% accuracy, the achieved 81% accuracy aligns closely with benchmark results in the literature, thereby validating the effectiveness of the model’s architecture and training strategy [36].

As shown in Table 2, when examining individual categories, the “engine idling” category stood out with the highest accuracy (0.94) and F1-score (0.90), demonstrating the model’s proficiency in distinguishing engine idling sounds from other noises. Other categories such as “air conditioner”, “jackhammer”, and “siren” also exhibited relatively high performance, attributed to their distinct acoustic features, which are easier for the model to differentiate. However, the model’s performance in the “children playing” (F1-score 0.70) and “dog bark” (F1-score 0.68) categories was relatively lower. This suggests that these sounds have overlapping frequencies or similar acoustic patterns with other categories, which likely contributed to occasional misclassification. This observation is consistent with findings in the literature, where complex or less distinct sounds have been shown to challenge machine learning models due to their resemblance to other sound classes.

In summary, the LSTM model’s results were competitive compared to similar studies in urban sound recognition, indicating that the chosen architecture and preprocessing steps were effective. The balanced macro and weighted F1-scores highlight the model’s ability to perform uniformly across diverse categories, contributing to the broader goal of balanced classification in urban sound environments. These results demonstrate the potential of the LSTM model within a fog computing architecture for efficient sound recognition in smart cities, though further testing with real-time data could provide additional insights into the system’s practical applicability.

Figure 5 illustrates the confusion matrix of LSTM model’s performance on the UrbanSound8K dataset, revealing that the model achieved high accuracy across most sound categories. Air conditioner, car horn, engine idling, and jackhammer sounds were generally classified correctly. However, some confusion was observed in categories such as children playing, dog bark, and street music. For instance, the children playing category was sometimes confused with dog bark and drilling sounds. Although the gunshot category was generally classified correctly, there were rare instances of confusion with children playing and jackhammer sounds. Overall, the LSTM model demonstrated effective performance in recognizing urban sounds, though confusion may occur between sounds with similar characteristics.

4.3. Dense Model Performance Results

The analysis process began with the extraction of MFCC features, using the n_mfcc = 60 parameter. This value was chosen based on its ability to balance computational efficiency and feature richness, ensuring that sound signals’ spectral characteristics were adequately represented. Three-second segments were extracted from the audio files with a 0.5 s offset. These values were determined through preliminary experimentation, ensuring appropriate temporal resolution for capturing key sound patterns. The resulting MFCC matrices were converted into a format suitable for the Dense model using the pad_sequences_2d function and standardized to a fixed length. The padding strategy was set to “post” to preserve critical information at the beginning of the feature sequences. The MFCC feature extraction time for the Dense model was measured as 66.12 s.

The dataset was divided into training and test sets using an 80–20% split, which was commonly adopted in the literature. Experiments with different ratios (70–30%, 75–25%, 85–15%) showed that the 80–20% split provided the best balance between model learning capacity and evaluation reliability. A 5-fold cross-validation approach was adopted, and the StratifiedKFold method was used with random_state = 0 to ensure reproducibility and balanced class distributions across folds. Labels were converted to numerical values using the LabelEncoder method and one-hot encoded for training. Stratified splitting was applied within each fold to maintain class balance.

The Dense model consisted of three layers, with 125 neurons in the first layer, 250 neurons in the second layer, and 125 neurons in the third layer. This architecture was selected to enhance the model’s capacity for learning complex sound patterns while maintaining reasonable training times. A dropout rate of 50% was applied after each layer to mitigate overfitting. This relatively high rate was selected based on initial tests, where lower dropout rates (20–30%) were found insufficient to prevent overfitting. The ReLU activation function was used in the hidden layers, and the softmax activation function was applied in the output layer for multi-class classification. This architecture and dropout strategy were determined through experimentation, ensuring optimal model performance. The ReLU activation function was chosen due to its ability to mitigate the vanishing gradient problem and improve convergence speed. The model was trained using the Adam optimizer, known for its adaptive learning rate and efficient convergence properties. The batch size was set to 32 to balance computational efficiency and generalization performance. The number of epochs was set to 100, based on observations of the training and validation curves, ensuring sufficient learning without excessive computational cost. Early stopping was not employed in this study. Instead, the number of epochs was determined based on the convergence analysis of the training and validation loss curves, ensuring that the model could fully learn the patterns without overfitting.

The training accuracy of the Dense model initially started at a low value of 12.5% but progressively increased to 80.6% as training advanced. This initial low accuracy can be attributed to the complexity of the dataset and the presence of multiple distinct sound categories, which challenged the model’s ability to learn accurate representations early in the training process. However, as the model continued to train, it gradually improved its accuracy by learning the intricate patterns within the dataset. Validation accuracy similarly exhibited an initial low point of 12.2%, eventually increasing to 85.6%, reflecting improved generalization capabilities over time. Training loss decreased from 10.0883 to 0.5895, while validation loss fell from 2.2881 to 0.5091, indicating that the model successfully learned to make more accurate predictions while avoiding overfitting due to the regularization techniques applied.

Table 3 presents a detailed performance evaluation of the Dense model on the UrbanSound8K dataset. Overall accuracy on the test set was calculated at 84%, with a macro average accuracy of 86% and an F1-score of 84%. These metrics were chosen to evaluate the model’s balanced performance across all classes, considering the potential class imbalances within the dataset. The macro average metric provides an equal-weighted performance evaluation across classes, while the weighted average accounts for class imbalances, thus offering a more realistic assessment of the model’s general performance.

According to the results, the model exhibited high accuracy and F1-scores in the “car horn” (91% F1-score) and “siren” (93% F1-score) categories. These high scores suggest that the model was proficient in recognizing these sounds, likely due to their distinct acoustic features. In particular, the “car horn” category achieved a precision of 97%, and the “siren” category achieved a recall of 90%, underscoring the model’s ability to differentiate these categories effectively. The unique and high-frequency features of these sounds may have facilitated the model’s ability to learn and classify them accurately.

In contrast, the model’s performance was lower in the “children playing” (68% F1-score) and “gunshot” (76% F1-score) categories. The “gunshot” category, in particular, exhibited a low recall of 63%, suggesting that the model had difficulty distinguishing this sound from other categories. This low recall may be due to the overlap in acoustic features between gunshot sounds and other sound types within the dataset. The relatively lower performance in the “children playing” category may be attributed to overlapping frequencies with other sounds, making it challenging for the model to differentiate this category. Similar difficulties in classifying these types of sounds have been reported in the literature, suggesting that these challenges are inherent to the dataset and the acoustic similarity of certain sounds.

Overall, the Dense model’s results are comparable to those reported in similar studies on urban sound classification, indicating a competitive accuracy level within the range of 82–85% commonly observed for Dense models on the UrbanSound8K dataset. Thus, the model’s performance aligns with expectations based on the existing literature, further validating the effectiveness of the chosen architecture and preprocessing steps. While the model achieved high performance in certain categories, it showed lower accuracy and F1-scores in some classes, which can likely be attributed to the presence of overlapping acoustic features between sound categories.

These findings demonstrate the Dense model’s potential to provide balanced classification performance across diverse sound categories within a fog computing architecture for urban sound recognition applications. However, further testing with real-time data could offer additional insights into the practical applicability of this system in real-world settings.

Figure 6 displays the confusion matrix for the Dense layers’ performance on the UrbanSound8K dataset. The model generally classified sounds such as air conditioner, car horn, engine idling, and jackhammer correctly. However, noticeable confusion was observed in the categories of children playing, dog bark, and street music. For example, the children playing category was often confused with air conditioner and dog bark. Categories such as drilling and gunshot were mostly classified correctly, although there were occasional confusions with other categories.

Critically evaluating the confusion matrix, the model shows strength in identifying distinct sounds like car horn and siren, likely due to their unique acoustic features. However, the confusion in categories such as children playing and dog bark suggests that these sounds share overlapping characteristics, which complicates their classification. The lower recall in the gunshot category, despite a high precision score, indicates that while the model is good at identifying gunshots when they occur, it misses a significant number of gunshots, reflecting a need for further refinement in this area.

Overall, Dense layers performed well in classifying urban sounds within UrbanSound8K dataset. Most categories were classified correctly, with some minor confusions. Notably, the “children playing” category exhibited more confusion with other categories. These confusions indicate that certain sounds may share similar characteristics with others, making them more challenging for the model to distinguish. While the model’s overall performance is positive, further improvements could be made to enhance accuracy in specific categories.

5. Discussion

This study aimed to evaluate the performance of three deep learning models—CNN, LSTM, and Dense—under an IoT architecture for recognizing and processing urban sounds. IoT studies typically rely on a classical four-layer architecture comprising sensing, network, service, and application layers. However, this study proposes an innovative data processing model to minimize certain issues inherent in the classical architecture, such as data latency and processing delays, and to contribute to the development of more efficient IoT-based sensing systems. The proposed model incorporates fog computing to perform real-time data processing between the sensing and Network Layers, ensuring that dataare processed closer to its origin, thereby reducing the need for extensive transmission to a central server before analysis.

The research focused on urban sound recognition using the UrbanSound8K dataset, which includes a diverse range of urban sounds such as traffic noise, sirens, human voices, and construction noises. Each model was trained and tested to classify these sounds in a setup designed to simulate real-time data from the Sensing Layer. Performance evaluation was conducted using metrics such as accuracy, precision, recall, and F1-score, ensuring that the substantial volume of data from the IoT Sensing Layerwas processed and interpreted effectively. The performance of CNN, LSTM, and Dense models was evaluated across various sound categories to provide a comprehensive comparison. Table 4 presents the results of this evaluation, summarizing the accuracy, precision, recall, and F1-score metrics for each model on the UrbanSound8K dataset. This comparison highlights the strengths and limitations of each model in classifying diverse urban sounds, offering valuable insights into their suitability for real-time urban sound recognition within an IoT-enabled fog computing framework.

The results demonstrated that the CNN model achieved the highest overall accuracy at 90%. This finding aligns with previous studies that highlight the CNN models’ superior performance in sound recognition tasks, especially when spectral representations like spectrograms and MFCCs are used, which preserve important spatial features for classification [12,37]. Boddapati et al. (2017) also found that CNN models excel in environmental sound classification, particularly when using spectral images and MFCC features. The findings of this study strongly support the conclusion that CNN models are highly effective in both image and sound recognition due to their capacity to capture spatial hierarchies within data, which is essential for distinguishing between different sound categories in urban environments [38]. This conclusion is further supported by recent work by Costa et al. (2017) on music classification, which demonstrated CNN models’ effectiveness in capturing complex auditory features applicable to urban sound recognition [39]. Similarly, 2023 studies have validated CNN models’ applicability in urban environments, particularly when deployed on fog nodes to reduce latency and improve real-time capabilities in dense urban settings (e.g., traffic prediction and urban noise monitoring) [40,41]. Other studies have similarly supported the high performance of CNN models. For example, Giannakopoulos and Pikrakis (2014) underscored CNN models’ role in sound analysis, illustrating the model’s utility for extracting detailed features from audio data [42]. Previous studies have demonstrated the effectiveness of CNN models in environmental sound classification. For instance, it was shown that a decision-level fusion-based two-stream CNN model achieved notable success in this domain [43], while deep CNN models enhanced with mixup techniques were reported to improve performance in classification tasks [44]. Building on these findings, recent advancements in CNN architectures, such as those presented by Su et al. (2023) and Zhang et al. (2023), have further expanded the potential of CNNs in sound analysis by integrating fusion-based approaches and augmentation techniques. These advancements highlight the growing applicability of CNNs in sound analysis, aligning with and supporting the findings obtained in this study, thereby validating the robustness of the proposed methodology [45,46].

In addition to these general findings, CNN models have been specifically employed for urban sound classification using the UrbanSound8K dataset. In one study, data augmentation techniques, including pitch shifting, time stretching, and SpecAugment, were applied on a five-layer CNN model. This study, conducted in 2019, achieved its highest accuracy of 79.2% and an F1-score of 80.3% when SpecAugment and time-stretching techniques were combined [47]. In another study from 2021, a deep one-dimensional CNN model was proposed, where MFCCs were used for feature extraction. The model achieved an accuracy of 84.02%, indicating its capability to effectively classify diverse urban sounds [48]. More recently, in 2022, a different approach was taken by directly learning features from raw waveforms using a deeper CNN architecture. This method achieved a notable accuracy of 89%, demonstrating the potential of end-to-end learning approaches in urban sound classification [49].

Compared to these studies, the CNN model in the present work demonstrated superior performance with an accuracy of 90% and balanced macro and weighted F1-scores of 90% without employing advanced augmentation techniques. This result suggests that the architecture and preprocessing approach proposed here, particularly the use of MFCC-based feature extraction and the fog computing framework, contribute significantly to improving urban sound classification. While previous studies have shown that augmentation techniques can enhance model robustness, this study indicates that a well-structured CNN model with efficient preprocessing can achieve higher performance without extensive augmentation. Nonetheless, integrating augmentation methods could further improve the generalization and robustness of the model in real-world scenarios where varying acoustic conditions are encountered [50].

The LSTM model, while achieving a respectable overall accuracy of 81%, demonstrated lower performance in specific sound categories, such as “children playing” and “dog bark”. In our study, the F1-scores for these categories were 70% and 68%, respectively, which are significantly lower than other categories such as “engine idling” (90%) and “jackhammer” (88%). This outcome aligns with findings in the literature, which indicate that LSTM models may face challenges in recognizing sounds that lack clear temporal structure or overlap with other categories. A similar study using the UrbanSound8K dataset reported an accuracy of 79% for LSTM models, further supporting the observation that LSTM models, while advantageous for sequence-based data, may struggle with less distinct or overlapping temporal patterns [51]. Additionally, it was noted in a 2023 study that while LSTM models are effective in handling sequential data, they were outperformed by Dense architectures in tasks involving non-sequential or complex overlapping sounds, suggesting that LSTM models may require additional fine-tuning or hybrid approaches to improve their performance in urban sound classification [52].

The Dense model achieved an overall accuracy of 84%, with balanced F1-scores across categories, such as 89% for “engine idling” and 93% for “siren”. Despite performing better than the LSTM model, it did not reach the accuracy level of the CNN model, which achieved 90% accuracy. In comparison, a recent study using the same UrbanSound8K dataset reported a Dense model accuracy of 84%, which is consistent with the performance observed. This study emphasized that Dense models, while effective in capturing complex feature interactions, may face limitations in tasks requiring spatial feature extraction, particularly when spectrogram-based inputs are used [53]. The slight discrepancy in performance compared to the CNN model can be attributed to the Dense model’s architecture, which lacks convolutional layers that are critical for capturing local spatial features in spectrograms. Moreover, while the Dense model showed robustness in classifying distinct sounds, it exhibited slightly lower performance in overlapping categories, such as “children playing” and “dog bark”, with F1-scores of 68% and 76%, respectively.

These results suggest that deep learning models are reliable and effective tools for urban sound recognition applications. Integrating deep learning models with fog computing, as demonstrated in this research, offers significant advantages by enabling data processing closer to the source. This approach reduces latency and minimizes reliance on centralized data centers, which is essential for urban sound recognition applications requiring real-time data processing [36]. Unlike studies that primarily focus on more specific or limited datasets, this research highlights the adaptability of deep learning models in handling diverse and complex sound data, thereby broadening their applicability across various urban environments [54]. The impact of this integration extends beyond urban sound recognition, offering significant potential for other IoT applications in smart cities, such as environmental monitoring, security systems, and emergency management [55]. By enabling fast and reliable data processing, fog computing enhances the efficiency and sustainability of smart cities, presenting a promising solution to address the increasing data demands of modern urban environments. As fog computing technologies continue to advance, the performance and widespread adoption of these applications are expected to improve, further supporting the development of responsive, AI-driven urban management systems.

6. Conclusions

This study focused on evaluating CNN, LSTM, and Dense models for urban sound recognition using the UrbanSound8K dataset, aiming to develop an IoT–fog computing framework for real-time applications. The findings demonstrate that the CNN model exhibited the highest overall accuracy at 90%, particularly excelling in sound categories with distinct spectral features, such as car horns and sirens, due to its superior ability to capture local patterns in spectrograms. The Dense model followed with an accuracy of 84%, offering balanced performance across categories but slightly lagging behind the CNN model in terms of handling complex acoustic features.

In contrast, while the LSTM model achieved 81% accuracy, it faced challenges in recognizing transient and overlapping sounds, such as gunshots and children playing, where temporal dependencies were less pronounced. These results highlight that although the LSTM model is effective in processing sequential data, it may require additional fine-tuning or hybrid approaches for improved performance in such scenarios.

The integration of these models with a fog computing framework aims to reduce latency, minimize bandwidth usage, and enhance real-time data processing efficiency, making it suitable for smart city environments. However, further research is needed to improve performance in overlapping sound categories and validate the system under real-time conditions.

Future research will focus on developing hybrid models that combine the strengths of the CNN and LSTM models, exploring advanced feature extraction techniques, and deploying the system in real-world environments to evaluate its scalability and robustness.

Funding

This research received no funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

United Nations. United Nations Department of Economic and Social Affairs; United Nations: New York, NY, USA, 2018. [Google Scholar]
Al-Turjman, F.; Zahmatkesh, H.; Shahroze, R. An overview of security and privacy in smart cities’ IoT communications. Trans. Emerg. Telecommun. Technol. 2019, 33, e3677. [Google Scholar] [CrossRef]
Bibri, S.E.; Krogstie, J. The emerging data-driven Smart City and its innovative applied solutions for sustainability: The cases of London and Barcelona. Energy Inform. 2020, 3, 5. [Google Scholar] [CrossRef]
Jasim, N.A.; TH, H.; Rikabi, S.A. Design and Implementation of Smart City Applications Based on the Internet of Things. Int. J. Interact. Mob. Technol. 2021, 15, 4–15. [Google Scholar]
Rana, O.; Theodorou, M.; Zhao, L. Scalable real-time urban sound classification using fog computing. J. Parallel Distrib. Comput. 2019, 132, 62–72. [Google Scholar]
Atzori, L.; Iera, A.; Morabito, G. The internet of things: A survey. Comput. Netw. 2010, 54, 2787–2805. [Google Scholar] [CrossRef]
Al-Fuqaha, A.; Guizani, M.; Mohammadi, M.; Aledhari, M.; Ayyash, M. Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Commun. Surv. Tutor. 2015, 17, 2347–2376. [Google Scholar] [CrossRef]
Chataut, R.; Phoummalayvane, A.; Akl, R. Unleashing the power of IoT: A comprehensive review of IoT applications and future prospects in healthcare, agriculture, smart homes, smart cities, and industry 4.0. Sensors 2023, 23, 7194. [Google Scholar] [CrossRef] [PubMed]
AlJamal, M.; Mughaid, A.; Bani-Salameh, H.; Alzubi, S.; Abualigah, L. Optimizing risk mitigation: A simulation-based model for detecting fake IoT clients in smart city environments. Sustain. Comput. Inform. Syst. 2024, 43, 101019. [Google Scholar] [CrossRef]
Bonomi, F.; Milito, R.; Zhu, J.; Addepalli, S. Fog computing and its role in the internet of things. In Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, New York, NY, USA, 17 August 2012; pp. 13–16. [Google Scholar]
Tan, E.L.; Karnapi, F.A.; Ng, L.J.; Ooi, K.; Gan, W.S. Extracting urban sound information for residential areas in smart cities using an end-to-end IoT system. IEEE Internet Things J. 2021, 8, 14308–14321. [Google Scholar] [CrossRef]
Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar]
Zhang, C.; Liu, H.; Chen, Z. Distributed edge AI for urban sound recognition in smart cities. IEEE Internet Things J. 2022, 9, 835–846. [Google Scholar]
Mahmud, R.; Kotagiri, R.; Buyya, R. Fog computing: A taxonomy, survey and future directions. In Internet of Everything; Springer: Berlin/Heidelberg, Germany, 2020; pp. 103–130. [Google Scholar]
Zhao, Z.; Peng, Y.; Chen, Y.; Hu, Y. Energy-efficient fog computing for real-time speech recognition in IoT systems. IEEE Access 2018, 6, 31900–31911. [Google Scholar]
Zhang, W.; Li, H.; Wang, X. Distributed fog computing for speech recognition with privacy protection in IoT. IEEE Commun. Mag. 2017, 55, 125–131. [Google Scholar]
Zhao, G.; Pang, B.; Xu, Z.; Peng, D.; Zuo, D. Urban flood susceptibility assessment based on convolutional neural networks. J. Hydrol. 2020, 590, 125235. [Google Scholar] [CrossRef]
Graves, A. Generating sequences with recurrent neural networks. arXiv 2013, arXiv:1308.0850. [Google Scholar]
Sainath, T.N.; Weiss, R.J.; Senior, A.W.; Wilson, K.W.; Vinyals, O. Learning the speech front-end with raw waveform CLDNNs. In Interspeech; Google, Inc.: New York, NY, USA, 2015; pp. 1–5. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, Z.; Liu, C.; Fei, H.; Li, W.; Yu, J.; Cao, Y. Urban sound classification based on 2-order dense convolutional network using dual features. Appl. Acoust. 2020, 164, 107243. [Google Scholar] [CrossRef]
Alotaibi, S.; Khan, M.K. Energy-efficient IoT-based speech recognition system using fog computing in smart homes. Sustain. Cities Soc. 2020, 55, 102045. [Google Scholar]
Wang, J.; Wang, Z.; Zhang, P. Real-time speech recognition using fog computing for smart homes. J. Cloud Comput. 2021, 10, 1–13. [Google Scholar]
Turgut, Z. Mobility Management for the Internet of Things. Ph.D. Thesis, Computer Engineering Programme, Department of Computer Engineering, Institute of Science, Istanbul University, Istanbul, Türkiye, 2018. [Google Scholar]
Marjani, M.; Nasaruddin, F.; Gani, A.; Karim, A.; Hashem, I.A.T.; Siddiqa, A.; Yaqoob, I. Big IoT data analytics: Architecture, opportunities, and open research challenges. IEEE Access 2017, 5, 5247–5261. [Google Scholar]
Xhaferra, E.; Ismaili, F.; Cina, E.; Mitre, A. A conceptual framework for leveraging cloud and fog computing in diabetes prediction via machine learning algorithms: A proposed implementation. J. Theor. Appl. Inf. Technol. 2024, 102, 6004–6026. [Google Scholar]
Kaya, Ş.M.; Erdem, A.; Güneş, A. A smart data pre-processing approach to effective management of big health data in IoT edge. Smart Homecare Technol. TeleHealth 2021, 8, 9–21. [Google Scholar] [CrossRef]
Li, S.; Choo, K.K.R.; Sun, Q.; Buchanan, W.J.; Cao, J. IoT forensics: Amazon echo as a use case. IEEE Internet Things J. 2019, 6, 6487–6497. [Google Scholar] [CrossRef]
Sun, Y.; Wu, X.; Zhou, Q.; Yu, R. Fog computing and its applications for Internet of Things: A review. IEEE Access 2021, 9, 11734–11745. [Google Scholar]
Kavre, M.; Gadekar, A.; Gadhade, Y. Internet of Things (IoT): A survey. In Proceedings of the 2019 IEEE Pune Section International Conference (PuneCon), Pune, India, 18–20 December 2019; pp. 1–6. [Google Scholar]
Benites, A.J.; Simões, A.F. Assessing the urban sustainable development strategy: An application of a smart city services sustainability taxonomy. Ecol. Indic. 2021, 127, 107734. [Google Scholar] [CrossRef]
Dahiya, S.; Chowdhury, R.; Tao, W.; Kumar, P. Biomass and lipid productivity by two algal strains of chlorella sorokiniana grown in hydrolysate of water hyacinth. Energies 2021, 14, 1411. [Google Scholar] [CrossRef]
Yigitcanlar, T.; Desouza, K.C.; Butler, L.; Roozkhosh, F. Contributions and risks of artificial intelligence (AI) in building smarter cities: Insights from a systematic review of the literature. Energies 2020, 13, 1473. [Google Scholar] [CrossRef]
Hollands, R.G. Will the real smart city please stand up? Intelligent, progressive, or entrepreneurial? In The Routledge Companion to Smart Cities; Routledge: Oxfordshire, UK, 2020; pp. 179–199. [Google Scholar]
Tang, C.; Xia, S.; Liu, C.; Wei, X.; Bao, Y.; Chen, W. Fog-enabled smart campus: Architecture and challenges. In Proceedings of the Security and Privacy in New Computing Environments: Second EAI International Conference, SPNCE 2019, Tianjin, China, 13–14 April 2019; pp. 605–614. [Google Scholar]
Baucas, M.J.; Spachos, P. Using cloud and fog computing for large scale IoT-based urban sound classification. Simul. Model. Pract. Theory 2020, 101, 102013. [Google Scholar] [CrossRef]
Tokozume, Y.; Harada, T. Learning environmental sounds with end-to-end convolutional neural network. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
Boddapati, V.; Petef, A.; Rasmusson, J.; Lundberg, L. Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 2017, 112, 2048–2056. [Google Scholar] [CrossRef]
Costa, Y.M.; Oliveira, L.S.; Silla, C.N., Jr. An evaluation of convolutional neural networks for music classification using spectrograms. Appl. Soft Comput. 2017, 52, 28–38. [Google Scholar] [CrossRef]
Ateya, A.A.; Soliman, N.F.; Alkanhel, R.; Alhussan, A.A.; Muthanna, A.; Koucheryavy, A. Lightweight deep learning-based model for traffic prediction in fog-enabled dense deployed iot networks. J. Electr. Eng. Technol. 2023, 18, 2275–2285. [Google Scholar] [CrossRef]
Peng, B.; Abdulla, W.H.; Kevin, I.; Wang, K. Urban Noise Monitoring using Edge Computing with CNN-LSTM on Jetson Nano. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 2244–2250. [Google Scholar]
Giannakopoulos, T.; Pikrakis, A. Introduction to Audio Analysis; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
Su, Y.; Zhang, K.; Wang, J.; Madani, K. Environment sound classification using a two-stream CNN based on decision-level fusion. Sensors 2019, 19, 1733. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, S.; Cao, S.; Zhang, S. Deep convolutional neural network with mixup for environmental sound classification. In Pattern Recognition and Computer Vision: First Chinese Conference, PRCV 2018, Guangzhou, China, 23–26 November 2018; Proceedings, Part II; Springer Nature: Cham, Switzerland, 2018. [Google Scholar]
Su, Y.; Vosoughi, A.; Deng, S.; Tian, Y.; Xu, C. Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation. arXiv 2023, arXiv:2310.11713. [Google Scholar]
Zhang, H.; Guan, J.; Zhu, Q.; Xiao, F.; Liu, Y. Anomalous sound detection using self-attention-based frequency pattern analysis of machine sounds. arXiv 2023, arXiv:2308.14063. [Google Scholar]
Abdoli, S.; Cardinal, P.; Koerich, A.L. End-to-end environmental sound classification using a 1D convolutional neural network. Expert Syst. Appl. 2019, 136, 252–263. [Google Scholar] [CrossRef]
Rahman, A.A.; Angel Arul Jothi, J. Classification of urbansound8k: A study using convolutional neural network and multiple data augmentation techniques. In Proceedings of the Soft Computing and its Engineering Applications: Second International Conference, icSoftComp 2020, Changa, Anand, India, 11–12 December 2020; Proceedings 2. Springer: Singapore, 2021; pp. 52–64. [Google Scholar]
Yildirim, M. Automatic classification of environmental sounds with MFCC method and proposed deep model. Fırat Univ. J. Eng. Sci. 2022, 34, 449–457. [Google Scholar]
Lezhenin, I.; Bogach, N.; Pyshkin, E. Urban sound classification using long short-term memory neural network. In Proceedings of the 2019 Federated Conference on Computer Science and Information Systems (FedCSIS), Leipzig, Germany, 1–4 September 2019; pp. 57–60. [Google Scholar]
Barua, S.; Akter, T.; Musa, M.A.S.; Azim, M.A.A. Deep Learning Approach for Urban Sound Classification. Int. J. Comput. Appl. 2023, 975, 8887. [Google Scholar] [CrossRef]
Mohaimenuzzaman, M.; Bergmeir, C.; West, I.; Meyer, B. Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices. Pattern Recognit. 2023, 133, 109025. [Google Scholar] [CrossRef]
Deperlioglu, O.; Kose, U.; Gupta, D.; Khanna, A.; Sangaiah, A.K. Diagnosis of heart diseases by a secure internet of health things system based on autoencoder deep neural network. Comput. Commun. 2020, 162, 31–50. [Google Scholar] [CrossRef]
Palanisamy, K.; Singhania, D.; Yao, A. Rethinking CNN models for audio classification. arXiv 2020, arXiv:2007.11154. [Google Scholar]

Figure 1. Internet of Things architecture consisting of four main layers (adapted from [29]).

Figure 2. Smart city fog computing architecture (adapted from [34]).

Figure 3. Smart city sound detection architecture (adapted from [35]).

Figure 4. Confusion matrix for CNN layers.

Figure 5. Confusion matrix for LSTM layers.

Figure 6. Confusion matrix for dense layers.

Table 1. Classification performance of CNN model on UrbanSound8K dataset.

Class	Precision	Recall	F1-Score	Support
Air_conditioner	0.93	0.93	0.93	195
Car_horn	0.95	0.95	0.95	91
Children_playing	0.81	0.84	0.83	205
Dog_bark	0.84	0.82	0.83	182
Drilling	0.86	0.92	0.89	202
Engine_idling	0.93	0.94	0.94	216
Gun_shot	0.97	0.85	0.91	87
Jackhammer	0.95	0.92	0.93	187
Siren	0.97	0.93	0.95	199
Street_music	0.84	0.86	0.85	183
Accuracy	0.90	0.90	0.90	1747
Macro avg	0.90	0.90	0.90	1747
Weighted avg	0.90	0.90	0.90	1747

Table 2. Classification performance of LSTM model on UrbanSound8K dataset.

Class	Precision	Recall	F1-Score	Support
Air_conditioner	0.89	0.90	0.89	195
Car_horn	0.88	0.80	0.84	91
Children_playing	0.66	0.74	0.70	205
Dog_bark	0.71	0.65	0.68	182
Drilling	0.87	0.83	0.85	202
Engine_idling	0.94	0.88	0.90	216
Gun_shot	0.83	0.75	0.79	87
Jackhammer	0.87	0.88	0.88	187
Siren	0.85	0.84	0.85	199
Street music	0.70	0.81	0.75	183
Accuracy	0.81	0.81	0.81	1747
Macro avg	0.82	0.81	0.81	1747
Weighted avg	0.82	0.81	0.81	1747

Table 3. Classification performance of Dense model on UrbanSound8K dataset.

Class	Precision	Recall	F1-Score	Support
Air_conditioner	0.75	0.97	0.84	195
Car_horn	0.97	0.86	0.91	91
Children_playing	0.69	0.68	0.68	205
Dog_bark	0.76	0.77	0.76	182
Drilling	0.93	0.86	0.89	202
Engine_idling	0.91	0.94	0.93	216
Gun_shot	0.96	0.63	0.76	87
Jackhammer	0.89	0.93	0.91	187
Siren	0.96	0.90	0.93	199
Street music	0.78	0.75	0.77	183
Accuracy	0.84	0.84	0.84	1747
Macro avg	0.86	0.83	0.84	1747
Weighted avg	0.85	0.84	0.84	1747

Table 4. Performance comparison of CNN, LSTM, and Dense models on different sound categories (metrics in %).

Model	Metric	Air Conditioner	Car Horn	Children Playing	Dog Bark	Drilling	Engine Idling	Gunshot	Jackhammer	Siren	Street Music
CNN	Accuracy	0.93	0.95	0.81	0.84	0.86	0.93	0.97	0.95	0.97	0.84
	Precision	0.93	0.95	0.84	0.82	0.92	0.94	0.85	0.92	0.93	0.86
	Recall	0.93	0.95	0.83	0.83	0.89	0.94	0.91	0.93	0.95	0.85
	F1-score	0.89	0.88	0.66	0.71	0.87	0.94	0.83	0.87	0.85	0.7
LSTM	Accuracy	0.9	0.8	0.74	0.65	0.83	0.88	0.75	0.88	0.84	0.81
	Precision	0.89	0.84	0.7	0.68	0.85	0.9	0.79	0.88	0.85	0.75
	Recall	0.75	0.97	0.69	0.76	0.93	0.91	0.96	0.89	0.96	0.78
	F1-score	0.97	0.86	0.68	0.77	0.86	0.94	0.63	0.93	0.9	0.75
Dense	Accuracy	0.84	0.91	0.68	0.76	0.89	0.93	0.76	0.91	0.93	0.77
	Precision	0.93	0.95	0.81	0.84	0.86	0.93	0.97	0.95	0.97	0.84
	Recall	0.93	0.95	0.84	0.82	0.92	0.94	0.85	0.92	0.93	0.86
	F1-score	0.93	0.95	0.83	0.83	0.89	0.94	0.91	0.93	0.95	0.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

İşler, B. Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison. Appl. Sci. 2025, 15, 1201. https://doi.org/10.3390/app15031201

AMA Style

İşler B. Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison. Applied Sciences. 2025; 15(3):1201. https://doi.org/10.3390/app15031201

Chicago/Turabian Style

İşler, Buket. 2025. "Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison" Applied Sciences 15, no. 3: 1201. https://doi.org/10.3390/app15031201

APA Style

İşler, B. (2025). Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison. Applied Sciences, 15(3), 1201. https://doi.org/10.3390/app15031201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison

Abstract

1. Introduction

2. IoT and Smart City

2.1. Internet of Things (IoT)

2.2. Smart City

3. Material and Methodology

3.1. Data Set

3.2. Methodology

4. Experimental Results

4.1. CNN Model Performance Results

4.2. LSTM Model Performance Results

4.3. Dense Model Performance Results

5. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI