4.2. Dataset
4.2.1. Data Definition and Introduction
The dataset used in this study includes a portion of the publicly available D-fire dataset and images collected through direct web crawling. The D-fire dataset is a public dataset of real-world images for fire and smoke detection, divided into fire-only images, smoke-only images, images with both fire and smoke, and normal images. The dataset is collected from real-world environments and is labeled with high quality, and duplicate images were removed to increase the reliability of the data. It reflects a wide variety of situations to help models work accurately in the real world, and the large dataset provides reliable performance in complex situations. The D-fire dataset is divided into four main categories. First, fire-only images are those where the fire phenomenon is clearly visible. Second, smoke-only images are those where there is no fire, but smoke is captured. Smoke is an important early sign of a fire and is used to increase the sensitivity of prevention and detection systems. Third, images with both fire and smoke reflect a complex situation and help model real-world fire scenes. Finally, normal images represent situations where there is no fire or smoke at all and are used to train the system to avoid raising false alarms.
After collecting images from the D-fire dataset, the collected images were thoroughly reviewed to remove as many duplicates as possible, and unnecessary images were deleted to prevent duplicates from negatively impacting model training. A web crawl was performed on the fire, smoke, and normal images to obtain additional data for later experiments. The web crawl was performed by automatically collecting images from various sources based on specific keywords, and the crawled images were then categorized into fire, smoke, and normal conditions. The D-fire dataset and the data collected from the web crawl were merged to form the final dataset for the experiments by deleting images that were not real, for example, images of fire or Photoshopped images.
Figure 8 shows a graph of the number of images with corresponding labels: 6144 normal images, 1216 smoke images, and 2549 fire images, totaling 9909 images. In general, when collecting image data in a distributed environment, fire and smoke are less frequent than normal situations; we took this fact into account when collecting data.
Figure 9 shows sample images from the dataset according to their labels.
This experiment was designed based on prior research showing that federated learning on small datasets can work well with fewer than 10 clients [
44]; while federated learning typically assumes an environment with many clients, prior research has shown that sufficient learning results can be achieved with fewer clients without significant performance degradation. For this reason, we limit the number of clients to less than 10 to evaluate the performance and efficiency of federated learning.
In this experiment, we configured one server and 3 clients, 5 clients, and 10 clients to compare the impact of different numbers of clients on associative learning.
Table 2 shows the distribution of image data used by each client and server by label. All clients and servers used the same 3000 images, but each client set a different percentage of data per label to reproduce a non-independent and identically distributed (Non-IID) environment. By non-IID, we mean that each client does not have the same distribution of data and contains data biased toward certain classes. In this experiment, we skewed the distribution by varying the ratio between labels on each client to address data imbalance issues that may occur in real-world environments.
Specifically, Server and Client 1, Client 5, and Client 6 have a balanced data distribution with equal proportions of fire, smoke, and normal data, 1:1:1. Client 2, Client 7, and Client 8, on the other hand, have a heavy normal data setup, with a 1:1:4 distribution of fire, smoke, and normal data. Client 3 and Client 10 have a 4:1:1 ratio of fire, smoke, and normal data, reflecting the dominance of fire data. Finally, Client 4 and Client 9 have a distribution that is heavily dominated by smoke data, so we set the fire, smoke, and normal data to 1:4:1.
4.2.2. Hyperparameters
For initial ViT training using server-side data, we used the pretrained Google vit-base-patch16-224 model using the ViTForImageClassification and ViTImageProcessor classes (Google, Mountain View, CA, USA). The number of epochs was 50; the learning rate was set at 0.0001. An Adam optimizer and cross-entropy loss criterion were adopted. For the FL structure using the .pth file obtained from the initial ViT training, the number of experimental epochs was 10; the learning rate was set at 0.0001, and the Adam optimizer was used. The step for moving the center point was 1e6, and if the average silhouette score of the three clients exceeded 0.8, the step size was reduced and multiplied by 0.9 per round. When passing the global center point from the server to the client, the ratio was set as step/2 to reflect the style vector of each client. The distance threshold used to calculate the distance between elements in the same cluster and sum distance data that are less than the threshold was 30.
4.4. Results
As a result of having three clients,
Table 3 lists the NMI values for each client per round.
Table 4 lists the silhouette scores for each client per round, and
Table 5 lists the accuracy values for the server and each client per round.
As a result of having five clients,
Table 6 lists the NMI values for each client per round.
Table 7 lists the silhouette scores for each client per round, and
Table 8 lists the accuracy values for the server and each client per round.
As a result of 10 clients,
Table 9 lists the NMI values for each client per round.
Table 10 lists the silhouette scores for each client per round, and
Table 11 lists the accuracy values for the server and each client per round.
First, based on NMI, we see that with three clients, the clustering quality is stable, with NMI scores mostly staying above 0.80. This means that the local models generated by each client reflect the overall data well. However, with five clients, the NMI fluctuates between 0.70 and 0.85, and the variation in clustering quality between clients increases. This suggests that as the data are distributed across more clients, the quality of clustering may decrease on some clients. With 10 clients, the NMI score varies widely from the low 0.60s to 0.80, suggesting that as the number of clients increases, the data distribution becomes more unbalanced and clustering performance is likely to deteriorate. In particular, some clients may have very low NMI scores.
We also see differences in terms of silhouette score. With three clients, the silhouette score consistently shows values close to 0.8, indicating high data aggregation within clusters and clear boundaries between clusters. With five clients, the silhouette score varies between 0.7 and 0.85, with more clients with unclear cluster boundaries than with three clients. With 10 clients, the silhouette score falls below 0.7 on average, indicating that the boundaries between clusters are more blurred, and clustering efficiency tends to decrease as the number of clients increases.
Looking at accuracy, with three clients, the accuracy ranges between 93% and 97%, indicating that the small number of clients means that the amount of data covered by each client is sufficient and the learning performance is stable. With five clients, the accuracy varies from 88% to 97%, with some clients performing below 90%, which is likely due to data imbalance or distributed learning. With 10 clients, the accuracy variance is wider, ranging from 85% to 95%, indicating that the difference in performance is due to the size and quality of the data each client handles.
In terms of convergence speed and learning stability, the convergence speed is relatively fast with three clients. When converged models from each client are merged, performance does not degrade significantly and optimization is stable. With five clients, convergence can be a bit slower, and the difference in model performance between clients can be large. With 10 clients, convergence is even slower, and the model gap between clients becomes larger, making learning less stable. This is because synchronization becomes more difficult and the optimization process becomes more complex. In conclusion, three clients provides overall stable results in performance metrics such as NMI, silhouette score, and accuracy. It is also very favorable in terms of convergence speed and communication efficiency. On the other hand, as the number of clients increases to 5 and 10, the performance variance tends to increase, the clustering quality decreases, and the communication and computational resource consumption increases. Therefore, three clients is the optimal setting for balancing resource efficiency and performance.
Figure 12 graphs the NMI values of each client over the round, and
Figure 13 graphs the silhouette score values of each client over the rounds.
Figure 14 graphs the values of the server and each client over the rounds. For NMI, we can see that it decreases for all clients but remains at 0.6. Silhouette score shows an increase over the rounds for all clients, reaching 0.9. For accuracy, we can see that the server has the highest accuracy, and it does not change much over the rounds.
In this study, the confusion matrix was used for visualization.
Figure 15 shows the confusion matrices of the server and clients. From the figure, we can see that the proposed model classifies most cases correctly.
From the application of the bisecting K-means algorithm on the server, we see that the algorithm correctly classifies each label for all but 1 of the 3000 images, and a high classification accuracy is achieved. Meanwhile, for the clients, the model maintains high accuracy in both non-IID and independently and identically distributed (IID) environments. Near-perfect predictions were obtained for the fire class for all clients. Some confusion was observed in the cases of normal and smoke classes, but it did not significantly affect the overall performance. The results for Client 2 and Client 3 reflect the Non-IID data distribution, with confusion between normal and smoke classes, but the fire class is predicted very accurately, showing that the model is not biased and performs well despite the data imbalance between the classes. In conclusion, we can see that the model maintains overall high performance even with the non-IID data distribution, showing that it performs very well even when the data distribution is unbalanced.
In this study, high-dimensional data such as images were visualized in two dimensions using the PCA technique.
Figure 16 shows the 2D PCA visualization of the data distribution of each class in a 2D space. In the figure, squares represent fire; triangles, smoke; and circles, normal; the predicted classes are visualized as circles with the same color as the actual symbols. The figure shows that the proposed model separates the classes well and provides a clear understanding of the data structure and clustering performance.
We ran our experiments in the same setup as ViT, using ResNet18 as the CNN backbone. To compare the performance of a CNN and ViT,
Table 12 presents the NMI values of each client per round, which allows us to analyze the difference in clustering quality. In addition,
Table 13 compares the silhouette scores of each client per round to evaluate the degree of intracluster cohesion and intercluster separation. Finally,
Table 14 shows the overall performance change through the server accuracy values per round. According to the CNN-based experimental results, the clustering accuracy of the server performed relatively well at 0.587 in the early rounds, but the performance gradually decreased as the rounds progressed. The NMI value of each client was generally low, with a continuous downward trend from 0.334 to 0.087. This resulted in poor clustering quality for the clients. Silhouette scores were also low, averaging around 0.3 for each client, indicating that the boundaries between clusters were not clear and data within clusters were not well separated.
In contrast, the ViT-based experiments showed much more stable and superior performance compared to the CNN. The ViT model achieved high clustering accuracy from the early rounds and showed little performance degradation as the rounds progressed. In the final round, the ViT model maintained a high clustering accuracy of 0.9787, and the NMI values of the clients decreased slightly from 0.859 to 0.744 but still remained high. Silhouette scores were also above 0.8 for all clients, with well-defined boundaries between clusters and clear separation of data within clusters. ViT outperformed the CNN not only in clustering accuracy but also in NMI and silhouette scores, and it showed stable clustering performance even in the presence of data imbalance.
The proposed model achieves 98% classification accuracy for fire, smoke, and normal classes in an FL structure in a non-IID environment. The images used in the experiments are real images, improving the model’s applicability for industrial sites. This model can make a significant contribution in the event of a fire. The proposed model provides reliable data and highly accurate fire classification and is expected to play an important role in building a system for the early diagnosis of fire in industrial sites.