1. Introduction
Human Activity Recognition (HAR) is an active research area, with applications ranging from healthcare analysis [
1] to the automation of intelligent home systems [
2]. Despite its potential, automated HAR still faces challenges like overfitting, which mainly arises due to limited data and the inability of traditional processes to capture the finer details of human activity through sensors. This can lead to situations where the HAR system does not correctly identify human actions, impacting its performance in critical areas like health checks or home care. Solving these issues is important not only to improve the accuracy of HAR but also to unlock its full potential in different application areas. In cyber-physical systems, accurate HAR is critical for ensuring smooth interaction between physical components and computational elements. As these systems grow in complexity, HAR models must reliably process and interpret sensor data to maintain safety, efficiency, and user satisfaction in real-time environments. In healthcare, for example, accurately identifying activities can lead to early intervention, potentially saving lives. Similarly, accurate HAR in smart homes can improve energy efficiency and enhance the user experience by automating tasks based on known activities. The importance of solving these problems can be seen from both scientific and practical perspectives. However, overfitting remains a common issue in HAR, and models trained with limited data generally perform poorly in new scenarios [
3]. Additionally, reliance on fixed features often hinders the adaptability of HAR systems to various real-world situations. Researchers are exploring new methods, such as non-monotonic representation [
4], which could improve HAR accuracy, especially in critical areas like healthcare. Accurate HAR is vital in clinical practice because it enables early detection, improves patient outcomes, and reduces medical costs. Advances in HAR use sensor data to compare state-of-the-art models and highlight current challenges and future research directions in the field.
Moreover, a recent study [
5] showed the possibility of combining multiple datasets to improve HAR performance, demonstrating that fusing sensor data with video input can create better information models. Another study by Martinez-Rios and Alvarez [
6] highlighted the benefits of using adaptive training in HAR, which protects the pre-trained intensity model from impact effects, thereby mitigating the problem of limited data usage. In addition, the benefits of employing tracking methods in a HAR model for the capture of physical characteristics were explained by Han et al. [
7] to enhance recognition accuracy.
Furthermore, with the advent of smart devices, rapid progress in HAR has been observed [
8]. Examples include fitness monitors [
1] and smartwatches [
9], which have embedded accelerometers and gyroscopes that continuously monitor fine-grained motions. This uninterrupted stream of data enables accurate and real-time tracking and recognition of different activities. As an illustration, an experiment showed that wearable sensors can detect subtle abnormalities in human gait, which is essential for early identification of neurological disorders [
10]. Recently, a novel approach called ConvBoost was introduced by Shao et al. [
3] that enhances the capacity of Convolutional Neural Networks (CNNs) to recognize activities in sensor data by enhancing data to addressed overfitting. This allows for an increased improvement in model accuracy through the employment of improved training methods.
Das et al. [
2] argued that one of the emerging focuses in HAR is the interpretability of artificial intelligence, especially for smart homes. Their argument is particularly important given the need for clear decision-making processes within any system that employs AI in the management of patients’ health status, as well as in home automation systems. By making AI more understandable, these systems can earn people’s trust and provide insights that are easier to act on.
In recent years, many techniques, such as data generation [
3] and deep learning [
11], have been used to solve issues related to HAR. However, these techniques, still have issues with overfitting and with the continuous data streams that fail to predict the correct activity. For example, deep learning models require different types of data and large datasets; on the other hand, methods that enhance data might miss out on capturing the full range of human activities.
In this paper, we propose a novel approach called Discrete Human Activity Recognition (DiscHAR) that addresses these challenges by combining advanced techniques. We use a modeling process called R-Frame [
3] and a data enhancement method known as the Mixed-up approach [
3,
12] to create rich and varied data. After that, we apply K-means vector quantization to turn these data into distinct categories with the aim of improving the model’s accuracy and interpretation.
The limited availability of sensor data and the complexity of human activities pose serious challenges in developing reliable and accurate HAR models. In HAR, problems arise when trying to solve overfitting and function representation issues. However, these challenges also bring opportunities for innovation. By using techniques such as semi-supervised learning and higher-order data representation like vector quantization, we can resolve data gaps and improve model interpretation. Next, we convert these data into simpler, discrete representations using vector quantization via K-means. For the value of
k in K-means, we use the elbow method to determine the optimal number of clusters for each class in the dataset. Then, we convert the discrete sequences into one-hot encoded vectors to pass them to the model. Finally, we pass this vector to the CNN model, ensuring precise recognition of human activities.
Figure 1 illustrates the entire framework. Details of each step are discussed below.
For example, Haresamudram et al. [
13] examined the development of comparative predictive coding and demonstrated the role of self-monitoring in learning representations from valid information. This approach can reduce the dependencies on large datasets and increase the learning of HAR systems. Similarly, Swain et al. [
14] explored the use of WiFi network logs to describe student interactions and their relationship to learning, showing the ever-so-different uses of HAR.
Our approach reduces competition and increases data diversity by employing complex data methodologies including an improved data layer (mixed) and structural model (R-Frame) to solve HAR challenges. Next, we convert continuous features into discrete representations using K-means vector quantization, which enhances HAR accuracy and resolves enduring issues in the area. By providing an accurate and useful analysis, we aim to present the efficacy of our planning method and further the development of HAR technology [
15].
The integration of HAR with the Internet of Things (IoT) is also a possible area for more research. Relevant information may flow easily from IoT devices, improving HAR systems and generating more intelligent and flexible settings. For example, as shown by [
16,
17], IoT-enabled HAR systems can dynamically change home automation settings based on real-time activity recognition. This integration highlights the opportunity for changes in complex HAR techniques that can result in more specific and sensitive smart home experiences.
The following are our research contributions achieved in this work:
Data generation: We present a novel method that uses the sample window to obtain the sample size, then generate data using the mix-up technique to solve the data limitation in human activity recognition.
Discrete representation: The generated continuous data are converted to a discrete form using K-means vector quantization to improve accuracy.
Reduced overfitting: Discrete data can help in reducing the risk of overfitting, as the model is less likely to fit to noise and minor fluctuations in the data.
Accuracy: In capturing similar movements, our proposed system outperformed existing state-of-the-art methods in terms of accuracy. Specifically, our model achieved accuracy for OPP79, for PAMAP2, and for WISDM.
In summary, we are able to utilize this technology in many kinds of applications by resolving common problems like overfitting and feature representation in HAR. Our approach uses data generation and representation techniques and aims to improve the accuracy and usability of HAR for applications such as healthcare, smart homes, etc. It aims to pave the way for new applications in these fields. Through continuous research and analysis, we strive to improve HAR and its impact on quality of life and well-being. By combining new advances and existing methods, as proposed in [
2,
3], we foresee a future in which HAR systems will be more accurate, defined, and versatile, enhancing the advancement of features of the smart environment and tuning personal healthcare ability.
In the following sections, we first provide a comprehensive literature review (
Section 2), where we examine existing research and developments in human activity recognition (HAR) and the application of deep learning techniques. This sets the stage for our proposed methodology (
Section 3), which includes data generation, feature extraction, and classification specifically designed for HAR. We then detail our experimental setup (
Section 5), including dataset selection and preprocessing steps. The architecture of our convolutional neural network (CNN) model (
Section 5.2) is outlined next, describing how it was structured and optimized for HAR tasks. Finally, we present our experimental results in
Section 5.3, analyzing the performance and effectiveness of our approach and discussing its contribution to the advancement of HAR technologies.
2. Literature Review
In recent years, various methods have been explored to improve human activity recognition (HAR) systems, particularly in handling challenges like overfitting and underfitting, data limitations, and interpretability issues. We categorize the key literature into several themes, including machine learning HAR methods, deep learning approaches, AI-based approaches, and computer vision techniques.
Previous research on human activity recognition (HAR) has explored various ways to deal with the problems of over- and underfitting. Traditional methods have focused on machine learning techniques that use methods such as Support Vector Machine (SVM) and random forest to classify tasks based on traditional features [
18]. However, these methods often face limitations in dealing with the complexity and variability of human activities, so more methods are needed.
Previously, it was common for human activity recognition (HAR) to rely heavily on machine learning algorithms. These techniques included experts manually extracting features from sensor data and training classifiers with them, such as Support Vector Machines (SVMs) and k-Nearest Neighbors (k-NN), to enable the classification of various human behaviors [
19]. Although these methods work well, they also have some disadvantages. They often have difficulty recognizing new or less visible functions and adapting to changes in sensor placement or user behavior. Additionally, because traits are extracted from individuals, these models sometimes fail to capture varying activity patterns. Despite these limitations, the early learning process of a two-level taxonomy model provides an important point for the development of more complex HAR processes and allows us to better understand the work of the same individuals.
In [
14], a combination of gradient boosting and linear regression models was used to analyze data collected from 163 students in 54 program groups. This dataset includes surveys and WiFi-based sensing logs. Three models are presented: MPeer, trained on peer evaluation scores from self-reported survey responses; MIndi, focusing on individual features; and MColloc, dedicated to features representing collocation among group members. Despite achieving a high precision score of
, the models exhibit a sensitivity of
due to the challenge of accurately determining individuals’ locations. The research identified a gap in WiFi-based sensing for precise location prediction, motivating further investigation to enhance model performance in this context.
Examining HAR, [
20] delved into understanding and categorizing human actions using tools like smartphones and AI, underscoring HAR’s growing significance across diverse applications. The review spanned from 2011 to 2021, focusing on devices, AI techniques, and real-world applications. Despite HAR’s notable impact in healthcare, AI’s potential remains in its early stages, warranting more dependable and unbiased models. Additionally, the paper highlights the scarcity of research on abnormal activity detection and human action forecasting.
Considering the complexity of modeling the data generation process that exists in such an environment, investigating human activity recognition (HAR) studies is important, especially in areas of high value in the use of materials. Here, we present a new approach that emphasizes understanding the interactions between different parts and their meanings. We consider a wide range of designs taking care of data collection requirements using sophisticated neural architecture search (NAS) technology. The Sussex-Huawei Motion Dataset, a database that collects sensor data of multiple types, including accelerometer, gyroscope, cellular network, WiFi network, and audio data, was used to assess the performance of the proposed method. This study demonstrates the effectiveness of the design by restricting access to the learning process of the neural network. These findings highlight the potential to create robust and effective learning programs, especially in data sampling, paving the way for the improvement of human performance in sensor-rich environments.
In recent years, deep learning has emerged as an important component in HAR, providing the ability to learn representative data from raw data. Convolutional neural networks (CNN) in deep learning are becoming popular due to their better performance in learning spatial patterns from sensor data and temporal patterns, leading to better recognition [
21]. This paper [
22] reviews machine learning and deep learning developments within the scope of HAR and ends by examining various ways in which CNNs can be used to solve overfitting, as well as feature representation issues, in HAR.
The application of deep learning allows for automatic extraction of hierarchical characteristics straight from raw sensor data, which has improved the area of HAR in recent years. In particular, in terms of collecting spatial and temporal patterns in time-series data, CNNs have advanced to the point where they are well suited for cognitive tasks. CNN architectures such as
-
and
-
are used for HAR and operate directly on raw sensor data or spectrograms obtained from the sensor readings [
23]. CNNs that hierarchically represent specific learning perform better than machine learning models, especially in situations with large and diverse data.
For applications like healthcare, the difficulty of using deep learning to analyze human activities is important [
3]. While deep learning shows promising performance, it often faces issues due to the limited availability of training data, which may result in compromised performance. ConvBoost is proposed as a solution for limited training data. ConvBoost generates additional information from multiple dimensions to improve the performance and capability of the ConvNet-based HAR model. This approach has proven that ConvBoost outperforms ConvNet based on F1 scores on many datasets. However, some difficulties remain, especially in differentiating similar activities. This research adds to the work of HAR with potential implications for improving healthcare and behavior analysis [
3].
To improve the ability of wearable devices to recognize human activities, Haresamudram et al. [
4] explored the use of “uniform representation” to transform continuous data. This new approach promises to expand the capabilities of these devices and improve their overall performance. This approach involves a two-step process, i.e., first, training the model using public data with complex methods and, secondly, fine tuning it for specific targets. Evaluation across multiple databases shows significant improvements over traditional methods. However, challenges in differentiating similar activities remains as an open research question.
In a work proposed by Shan et al. [
24], two variants of LSTM, namely a delay model and transition model, were introduced as new training strategies to solve the challenge of recognizing sporadic, non-periodic activities within a background of irrelevant activities. The delay model incorporates predefined delay intervals to add contextual depth, enhancing LSTM’s ability to detect and recognize these sporadic activities over time. The transition model, on the other hand, captures the subtle transitions between different activities, which helps in recognizing sporadic behaviors emerging from continuous actions. These models integrate both continuous data and event-related information, making them robust for the identification of critical patterns. By evaluating their approach on publicly available datasets, the authors showed promising results in detecting activities related to accident-prone scenarios, underscoring the practical utility of these advanced LSTM training methods in real-world applications.
The authors of [
25] introduced a novel approach to enhance HAR using a combination of real and virtual data. Leveraging ChatGPT, a sophisticated text generator, the authors generated activity descriptions that were subsequently transformed into virtual data simulating human movements and interactions. This methodology significantly reduces the resource-intensive process of collecting real-world data, offering cost and time savings. Evaluations conducted on three widely used HAR datasets (RealWorld, PAMAP2, and USC-HAD) demonstrated notable improvements in HAR model performance. By integrating virtual data, the proposed method enhances model accuracy by
% to
% compared to using real data alone, showcasing the potential to enhance HAR systems efficiently.
Addressing the challenge of recognizing human activities using sensor data in scenarios with limited labeled training data, a study conducted by Plötz [
5] focused on three key areas: representation learning, self-supervised methods, and cross-modality transfer. The goal was to enhance human activity recognition systems’ performance by making them smarter and more adaptable.
After exploring traditional machine learning techniques and advancements in deep learning, particularly CNNs, we discuss LLM-based approaches [
26] because they offer innovative solutions for data augmentation and the handling of small datasets, which are common challenges in HAR. By leveraging LLMs, such as ChatGPT, we can synthesize data to augment our training sets, thereby addressing issues of data scarcity and overfitting. This integration not only helps us enhance task performance but also ensures privacy by fine tuning local models without exposing sensitive data, making it a practical approach for HAR applications.
Proposing enhancements to the Contrastive Predictive Coding (CPC) framework for self-supervised learning within sensor-based human activity recognition (HAR), the study discussed in [
13] introduced advancements in encoder architecture, autoregressive networks, and future prediction tasks. Through extensive experimentation conducted across diverse datasets and sensor positions, the research highlighted the efficacy of the enhanced CPC framework compared to its predecessor. The findings underscore the potential of this approach in practical HAR applications, particularly in scenarios where annotated data are scarce or difficult to obtain. Emphasizing the benefits of the CPC framework, this paper underscores its ability to learn meaningful representations from abundant unlabeled sensor data.
The Textless Translatotron model provides valuable insights into quantization and the representation of data in discrete forms, which is relevant for our work. As an end-to-end speech-to-speech translation (S2ST) model, Textless Translatotron operates without relying on textual supervision by predicting discrete representations of target speech via a VQ-VAE quantizer, bypassing intermediate text and phoneme dependencies [
27]. This approach not only demonstrates competitive translation quality on datasets like CVSS-C and Fisher Spanish–English, but it also shows how quantization can be effectively employed to transform continuous speech data into discrete units for high-quality language translation. Such techniques are used in our own work, providing a framework for the representation of continuous data in discrete form using vector quantization.
In addition to traditional machine learning and deep learning techniques, we discuss vision-based approaches due to their ability to capture rich spatial information that is crucial for recognizing complex human activities. Vision-based methods [
28], enhanced by techniques like Finite Discrete Tokens (FDT) for cross-modal alignment, address the granularity challenges between visual and textual data, leading to improved accuracy and performance. Incorporating vision-based approaches in HAR provides a comprehensive understanding of human activities, leveraging detailed visual context that complements sensor data, thereby enhancing the overall effectiveness and robustness of HAR systems.
The LayoutDM model, designed for the crafting of structured layouts with specific constraints [
29], employs a discrete diffusion approach to iteratively refine layouts while considering structured data. Notably, it incorporates constraints during inference for conditional generation. LayoutDM showcases superior performance across diverse layout tasks and datasets, outpacing alternative methods and demonstrating its effectiveness. Key advantages of the model include its ability to address shortcomings in existing approaches and accommodate variable-length elements. However, the paper acknowledges the potential misuse of automatically generated content. Nonetheless, LayoutDM emerges as a successful tool for producing controlled layouts with wide-ranging applications.
There have been a number of advantages associated with the use of computer-based training in comparison to human-based training. Still, there are some problems that require more detailed study concerning human activity recognition (HAR). The HAR model tends to overfit as a result of the limited availability of training data. Moreover, a lack of more definite information other than continuous features in this field inhibits the development of models that are more accurate and interpretable. Furthermore, there remains a research gap in effectively discriminating highly similar movements in HAR applications, which is crucial for tasks requiring precision in activity recognition.
In light of these gaps, the following research questions emerge:
What approaches can be employed to tackle the challenge of overfitting in human activity recognition (HAR)?
What potential benefits does the adoption of discrete sensor data representations offer in HAR?
What advanced analytical tools can be applied to analyze symbolic sequences derived from discretized sensor data in HAR?
How do discretized data simplify the analysis of complex movements in HAR?
These research questions guide our exploration of innovative methodologies to address the challenges in HAR, focusing on mitigating overfitting, enhancing data representations, and improving the accuracy of activity recognition systems.
In the following sections, we outline our proposed methodology (
Section 3), experimental setup (
Section 5), and CNN model architecture (
Section 5.2) and present experimental results (
Section 5.3). Our methodology encompasses data generation, feature extraction, and classification tailored for HAR. We detail dataset selection, preprocessing steps, and the model architecture. Subsequently, we present details of the experimental setup and the CNN model architecture and analyze results to contribute to HAR advancement.
3. Proposed Methodology
In our proposed methodology, we start by collecting sensor data on human activities like walking, sitting, and standing. We then enhance this dataset by using techniques like sampling
-
and augmentation
-
. Next, we convert these data into simpler, discrete representations using vector quantization via K-means. For the value of
k in K-means, we use the elbow method to determine the optimal number of clusters for each class in the dataset. Then, we convert the discrete sequences into one-hot encoded vectors to pass them to the model. Finally, we pass this vector to the CNN model, ensuring precise recognition of human activities.
Figure 1 illustrates the entire framework. Details of each step are discussed below.
3.1. Dataset
In our research, we use diverse sensor datasets from UCI repositories [
30,
31,
32] to capture a spectrum of physical activities under controlled conditions. These datasets include information from multiple sensors, like accelerometers, gyroscopes, and magnetometers, that capture a wide range of activities performed by different people. Using these rich sensor data is essential for our study, as it helps us assess and improve how well algorithms can recognize human activities. Through our research, we aim to advance cognitive processing studies by enhancing the accuracy and reliability of these algorithms in different scenarios.
3.2. Data Preprocessing
During the data preprocessing phase, we take various steps to ensure the quality and consistency of the dataset. First, we remove null or nonexistent entries from the data to eliminate inconsistencies in the data. Null values may affect the analysis and interpretation of data, so removing them helps to ensure a better model. We also identify and remove duplicate data elements to reduce noise and improve the efficiency of the dataset for analysis. By removing null values and duplicates, we improve the quality and reliability of the dataset, making it ready for further analysis and modeling.
To address sensor limitations and irregular data sampling in HAR systems, several techniques are effective in enhancing a HAR system’s robustness in smart home environments. For instance, data augmentation and sensor fusion can enhance sensor data by creating variations and combining information from multiple sources. In cases of irregular sampling, LSTM models and Temporal Convolutional Networks (TCNs) handle sequences with varying gaps, while imputation techniques fill in the missing data. Hybrid CNN–LSTM models capture both spatial and temporal patterns, improving accuracy for complex activities. Emerging models like transformers are also promising, as they manage sequential dependencies without requiring evenly spaced data.
3.3. Feature Extraction
For human activity recognition using the WISDM [
33], PAMAP2 [
30], and OPPORTUNITY [
34] datasets, we customize feature extraction according to the characteristics of each dataset. WISDM provides accelerometer data during activities such as walking and running. PAMAP2 includes accelerometer, gyroscope, and magnetometer readings to provide insight into daily activity. The OPPORTUNITY dataset enriches our functionality with various types of sensor data, including accelerometer, gyroscope, magnetometer, and ambient sensor readings. Using the unique properties of these data, we aim to create accurate models to recognize human activities in different situations.
Accelerometer readings represent the acceleration experienced by the subject. In human activity recognition, the magnitude and patterns of acceleration can indicate different activities. Walking, for example, creates rapid patterns with characteristic frequencies.
A gyroscope reading measures the rotational speed of an object. Rotational movements such as turns or changes in orientation can indicates specific activities.
where
,
, and
are the angular velocities along the
X,
Y, and
Z axes, respectively.
Magnetometer readings measure the strength and direction of the magnetic field around subjects, aiding in orientation detection. They do not provide direct information about location.
where
,
, and
are the magnetic field strengths along the
X,
Y, and
Z axes, respectively.
These sensors provide a comprehensive view of a person’s movement and orientation in three-dimensional space, enabling the recognition of various activities and gestures.
3.4. Data Generation:
We use data generation techniques to create diverse and comprehensive datasets for HAR. These techniques comprise two primary processes:
Within the sampling layer, we utilize a new method known as the Random Framing (R-Frame) booster. This method is superior to typical sliding window techniques. Rather than obtaining data in single slices at a time, R-Frame samples multiple slices during an entire period—the latter also being referred to as an “epoch”.
In every epoch, n data frames are captured that are closely adjacent to one another. Each frame, which is referred to as
, is characterized by a brief sequence of feature attributes and commences from a point
, including the next
L points, where
L is the length of the frame. This dense sampling helps us get a better and more detailed picture of the data over time.
By performing denser sampling, the model builds a finer level understanding of activities and provides a more detailed temporal representation, reducing data loss and optimizing the processing of activity duration, which changes over time. By enhancing the quality and richness of sampled data, the random frame (R-Frame) enhancer improves the results for HAR.
Once we obtain data slices with the sampling layer (R-Frame), we transition to the data augmentation layer, where we apply Mix-up. When applied, this method is instrumental in generating novel data by blending characters of the present dataset.
Consider two original samples from our dataset:
Sample 1:
- −
Features:
- −
Label: Class A
Sample 2:
- −
Features:
- −
Label: Class B
To create a new virtual sample, Mix-up employs linear interpolation between these samples using a mixing ratio (). For this example, we set .
The new features and labels are computed as follows:
Breaking this down, we calculate the following:
The new label is an interpolated combination of Class A and Class B based on . For simplicity, assume we create a label that is Class A and Class B.
Thus, the generated virtual sample is
Features:
Label: Weighted combination of Class A and Class B (e.g., ).
This process enables the generation of new samples that are a blend of the original data points, thereby enhancing the model’s exposure to a broader range of feature combinations and label distributions.While we use the same features for both classes, the values differ, allowing the Mix-up method to create new variations that can help the model learn more effectively. The choice of = 0.7 gives a slight advantage to Class A, ensuring that the model learns from both classes without being overly biased towards one. This balanced approach helps in developing a more robust model capable of generalizing better across diverse scenarios.
3.5. Vector Quantization Using K-Means
We apply a technique to identify the most appropriate number of clusters in any given set for K-means clustering in order to implement vector quantization. The process is described as follows:
Originally, we utilize the elbow method to uncover the best amount of clusters (K) for each category in the data. This essentially means plotting the Within-Cluster Sum of Squares (WCSS) against the number of clusters, then identifying the “elbow point” where the starts to decline substantially with respect to the number of clusters. It indicates the ideal number of clusters for data.
By utilizing the elbow technique, the optimum number of clusters for each group is identified, guaranteeing that K-means effectively captures the hidden data patterns without overfitting or underfitting.
In order to find an accurate number of groups (l), the K-means algorithm is suitable. Following a certain similarity based on properties, it categorizes attribute vectors of each of its clusters into l. This happens repeatedly; hence, any point (x) belonging to X is first assigned to the closest cluster (c) and, further, finds a new center ().
Vector quantization is achieved by clustering feature vectors using K-means such that every piece of information has its own representative from which all points within this group are measured in relation to. As such, information loss can be minimized when it comes to dimensionality reduction techniques.
In order to avoid loss of information in the course of performing vector quantification, is guaranteed that clusters per class are within a certain specified range with the aim of avoiding the over-compression of the data, resulting in distortion of information.
Balancing between dimensionality reduction and the preservation of data strength is achieved by fixing the limit of cluster size, thereby maximizing the efficiency of subsequent classification models.
3.6. Conversion to One-Hot Encoded Vector
Once the disecrete sequences are obtained via vector quantization with K-means clustering, the following step involves transforming a sequence into a one-hot encoded vector. This process, as well as the rationale behind it, is explained for clarity.
Every single sequence that is acquired as a result of vector quantization represents a definite cluster center that most comfortably defines the feature vectors of the appropriate group. These centers constitute a certain group of marks for these points. We use one-hot encoding to convert these discrete sequences into a format suitable for input into a CNN model. In this process, each discrete label is represented as a binary vector where only one element is set to 1 and all others are set to 0. The position of the 1 element corresponds to the index of the centroid in the cluster space.
CNN models usually necessitate that the input data be vectorized numbers that represent a particular feature in each dimension. This is not appropriate, since the model will fail to learn relationships between multiple forms if the sequences from vector quantization are directly fed into the CNN model. Alternatively, we can change the distribution data into some numerical representation that might be understood better by the CNN model. This modification allows the model to treat each centroid as a separate group, allowing it to accurately describe the relationship between centroids and activity class. This is crucial for the purpose of satisfying the CNN model’s input conditions. There is a need to maintain the consistency of all input patterns, where each one-hot encoding vector is of a length equal to the aggregate number of centroids.
3.7. Integration with CNN Models
The next step is putting these vectors into a CNN model in order to be classified after converting them from discrete sequences to one-hot encoded vectors.
The insertion of one-hot encoded vectors enhances effective learning and classification mechanisms for human activities by using the CNN model [
21]. This model captures the complex patterns and relations that exist in the input data for accurate detection using the hierarchical feature extraction abilities of CNNs.
5. Experiment and Results
In this section, we discuss the experimental setup, the description of which includes how to generate the dataset; data pre-processing (the stage that comes before the training of an Artificial Neural Network (ANN) may imply various transformations, such a z-form transformation or others); vector quantization (what type of method to use, including the typical block size); and the instantiation of the CNN architecture to classify human activities after each segment is evaluated by a practitioner’s eye or changes in detector signals.
5.1. Implementation
We generated a dataset for different human activities, like walking, running, sitting, and standing, in the first stage of implementation, and we used it to collect data. When collecting this information, different gadgets like accelerometers and gyroscopes were strapped on people who were doing these exercises. That is why we can say that the recorded information fully reflects real exercises, providing raw data for other operations.
We determined the optimal number of clusters for each activity class with the help of the elbow method, as presented in
Figure 2. By observing the within-cluster sum of squares (WCSS) results against the number of clusters, we were able to find out at what point the WCSS decrease rate declines considerably, thereby signaling the best number of clusters to form the Cluster Size (CS).
For researchers to know the best number of clusters, first, they find the optimal number of clusters. We first found the optimal number of clusters using the elbow method, then applied the K-means algorithm to group feature vectors into clusters based on their similarity. Hence, if there is an example where (optimal number), we can form 5 groups with similar attributes (feature vectors); all data points are assigned either one or another cluster centroid. Later, this process results in a lower dimensional space without losing many key entities.
Discrete sequences obtained by K-means clustering are converted into one-hot encoded vectors. Each discrete label is represented as a binary vector where one element is set to 1 and all others are set to 0. One-hot encoding transforms categorical data into a numerical representation suitable for input into the CNN model, enabling consistent dimensionality and effective learning of relationships between discrete labels and activity classes.
In the implementation of our proposed methodology, we used a CNN architecture for human activity recognition tasks. The model has multiple convolutional layers, followed by group normalization, max pooling, and fully connected layers. The model was trained on different datasets with different parameters using three layers. The model uses the Adam optimizer with a learning rate of . During training, data preprocessing techniques such as null-value removal, duplication removal, and data augmentation were applied to enhance the diversity and quality of the training dataset. The model was trained for 10 epochs, with performance metrics monitored throughout the training process. After integration, the trained model showed stable behavior with satisfactory accuracy and loss curves.
5.2. CNN Model Implementation
Our CNN model, called
, is designed to capture complex patterns in the input data through a series of convolutional and pooling layers, followed by a fully connected layer.The architecture is shown in
Figure 3 and is described below.
The first convolutional layer plays an important role in extracting low-level features from the input data. This layer has a kernel size of and 256 filters capturing basic patterns such as edges and textures. Normalizing the activations within each group helps to increase the learning process with four-group normalization. Subsequently, its dimensions decrease as a results of subsampling of the feature map using max pooling , reducing both the complexity and computational costs. Complex patterns in this model are described using the non-linearity of Rectified Linear Units (ReLUs).
The second convolutional layer builds on the features learned by the previous layer to refine them and capture more complex patterns and relationships in the data. Consequently, it preserves parameters identical to those of the first layer—256 filters, group normalization, max pooling, and ReLU activation. Hence, it proceeds this way in extracting features across hierarchies.
The third and last convolutional layer in feature extraction acts by enhancing the functionality of the recognized pattern. This layer employs the same settings as above, with the same 256 filters applied for feature mapping before carrying out a group normalization operation, as well as activation. The purpose behind designing it this way is that by progressively examining input datasets, it is possible extract intricate patterns necessary for correct identification.
To reduce overfitting and improve the model’s generalization capability, a dropout layer is inserted before the fully connected layers. With a dropout probability of , this layer randomly deactivates half of the neurons during training, forcing the model to learn robust features that are invariant to small variations in the input.
After feature extraction, the model is transferred to the classification stage with fully connected layers. The first fully connected layer reduces the dimensionality of the feature space, projecting the extracted features into a lower dimensional representation. With 8960 input features and 128 output features, this layer facilitates the transformation of abstract features into interpretable representations.
Following the first fully connected layer, another ReLU activation function introduces non-linearity, enabling the model to learn complex mappings between features and activity classes. Dropout regularization with a probability of is applied again to enhance the model’s resilience to overfitting.
The final fully connected layer serves as the output layer of the model, mapping the lower dimensional feature representation to the output space. With 128 input features and 12 output features (corresponding to the number of activity classes that differ depending on the dataset), this layer computes the probabilities of each class using a softmax activation function. The class with the highest probability is predicted as the output.
5.3. Results
The performance of the proposed model is evaluated using accuracy and the F1 score.
Table 1 summarizes the results compared to the technique proposed in [
3,
4,
35].
The performance of the model on the OPP79 dataset is evaluated using the F1 score shown in
Figure 4 and the loss curve shown in
Figure 5.
The performance of the model on the PAMAP2 dataset is evaluated using the F1 score shown in
Figure 6 and the loss curve shown in
Figure 7.
We tested the WISDM dataset with different learning rates, but the F1 score stayed at 100% every time. This is because the dataset contains clean and well-labeled sensor data from both smartphones and smartwatches collected from 51 subjects performing 18 different activities. The activities are distinct and varied, making it easy for models to recognize patterns. Additionally, the sliding window technique used to process the time-series data helps improve model performance. Previous research has also shown similar high accuracy with this dataset [
36,
37].
Figure 8 and
Figure 9 depict curves that were generated to check the impact of learning rates on the accuracy and loss of the model.