1. Introduction
The data from the World Health Organization show that approximately six hundred thousand falls occur yearly, which account for the second-highest mortality rate following traffic accidents [
1]. One-third of the aging population falls each year. USA statistics show that about 36 million falls occur annually and one-fourth (28%) of people aged 65 and above report falls yearly [
2]. Of these falls, 37% of them (8 million) were injured and needed medical treatment [
2]. Due to these reasons, there is a national imperative to develop cost-effective real-time fall detection with new sensor technologies and methods.
Fall-detection systems help us differentiate falls from non-fall activities so that an alert system to a remote monitoring point is automatically emitted as soon as the patient falls or in order to enable the deployment of a protective airbag, if the detection can be sufficiently accelerated [
3]. Recently, several methods have been proposed using different sensors with varying performance levels [
4,
5,
6]. The most common sensor technologies used for fall recognition are categorized into three classes: camera sensors, infrared sensors, and wearable sensors. Infrared and camera-based sensors are expensive sensors and as they record audiovisual signals, they have issues related to the privacy of patients and the stationary aspect of the system. Due to these limitations, wearable sensor-based fall detection offers a cheaper alternative to detect falls based on one or more wearable sensors that are attached to the user’s body or clothing so they can be carried everywhere.
The analysis of signals obtained from wearable sensors mounted on the human body is commonly used to monitor the health status of older patients with motion-assistive devices [
7]. These sensors typically generate complex hip motion signals that are difficult to interpret without expert intervention. A computationally efficient fall-detection modeling technique is required to provide a meaningful characterization of the sensor data, which could be leveraged to trigger the pneumatic actuation of a proactive airbag. This automatically analyzes sensor readings to infer the kind of human activity performed by a user. Many researchers have been developing supervised-based fall-detection methods in the last decades [
4,
5,
6]. However, there are still some limitations to obtaining fall datasets that contain near-fall activities, which are the cause of false positives and false negatives.
The most common methods for fall-detection systems that use wearable sensors involve thresholding or machine learning techniques [
4]. In the simple threshold fall-detection method, they calculate the threshold values of either the magnitude or the vertical projection of the acceleration [
8,
9]. When either values are greater than the specified thresholds, a fall is detected. In a related work, Viet uses both upper thresholds for the acceleration magnitude [
10]. If the upper threshold is exceeded in less than 1 s, a fall template using a wavelet transform is used for comparison with the acceleration signal. A fall is detected if the comparison produces a high similarity value. Although automatic fall detection using a threshold-based method of individual parameters calculated from accelerometry measurements has a high sensitivity, it has a relatively low specificity [
11]. For example, Purwar used a triaxial accelerometer to set the acceleration and torso orientation thresholds through experiments to detect falls, which achieved an accuracy of 81% [
12].
Montesinos [
5] and Klenk [
6] have thoroughly compared the performance of basic thresholding techniques with a wide set of supervised learning solutions. It is difficult to use threshold techniques as we decide the upper threshold value of fall and non-fall activities for new test subjects. Due to this limitation, recent approaches [
13] employ feature extraction engineering and machine learning classifiers to improve the detection accuracy.
In the last decade, many fall-detection methods have been developed using classical machine learning techniques [
14] and deep learning [
15,
16,
17]. Some of the developed fall-detection systems use logistic regression [
14], naive Bayes [
18,
19], decision tree [
18,
19], support vector machines [
18], and k-nearest neighbors [
18,
19,
20]. Deep learning algorithms have also been applied to gait analysis and fall-detection systems. Other methods detect falls using inertia sensor signals by converting the 1D sensor signals to 2D images of spectrotemporal footprints and then applying CNN to classify those converted images [
21,
22].
As a wearable sensor-based fall-detection method generates sequential datasets, it has the advantage of using sequential models such as LSTM. The sensor fusion of the accelerometer and gyroscope data streams using a hybrid CNN-LSTM method [
15], as well as LSTM-based activity recognition [
16,
23], is used for the recognition of falls. The study in [
16] uses a single inertial sensor placed in the trunk. The CNN and LSTM model inputs are the raw acceleration and angular velocity signals. The problem with Ruben’s [
15] approach is that he uses KNN classifiers—it is challenging to use such algorithms for real-time fall detection due to the Euclidean distance computational time.
Long short-term memory (LSTM) and transformers are popular algorithms used in sequence models to encode and process sequential data. In this article, we extend our previous works in [
17,
24] by proposing a new technique for fall-detection systems. Our approach is based on a vision transformer [
25,
26,
27] and deep LSTM [
28], which are used for image recognition and computer vision. First, we collected fall datasets, pre-processed, segmented, and balanced minor-class data samples. Fall kinematics data have different lengths, and it is difficult to include the fall event in a single-window input sequence. Thus, we introduce different sliding-window segmentation methods of the input sequence that help us to develop a real-time fall-detection system.
Recently, many researchers have been developing vision transformers (ViT) [
27] and deep LSTM [
28] methods for image classification. This paper explores convolutional feature extractors with deep learning feature fall-detection systems on wearable sensor datasets. Two deep learning models are used as our imminent detection model, the LSTM encoder and the transformer encoder. To balance the minor classes, several techniques—resampling, data augmentation, and customized loss functions—are used. We collect data using an accelerometer and a gyroscope sensor to detect the impending fall correctly, while reducing false positives and false negatives and collecting near-fall datasets. We also propose several sliding-window segmentation techniques that include the important signal information in a single observation window across the various features of sequential data to detect falls.
2. Materials and Methods
Two competing deep learning methods are used to detect falls from non-fall activities. We use wearable sensor signals to train and test the two models—LSTM and transformer encoders. First, we present the collected fall datasets and then explain the required pre-processing, sliding-window segmentation, and balancing of the dataset. Lastly, the different deep learning models are explained.
2.1. Data
One of us (CP) conducted fall experiments on human subjects in the Neuromechanics and Neuroplasticity Laboratory of San Diego State University. He collected the fall dataset with a sampling rate of 200 HZ [
29] based on accelerometers mounted on different parts of the subject’s body. The laboratory is equipped with wireless 3D motion-capture cameras that record human subject movements [
30]. The data were collected from sixteen subjects between the age of 20 and 50 years; two of them are females. The Noraxon myoMOTION research inertial measurement unit (IMU) sensors measure features such as acceleration and angular velocity [
29]. The datasets contain activities such as near-fall, backward and forward fall, obstacle fall, and ADL (activities of daily living).
2.2. Data Pre-Processing
First, the datasets are pre-processed to remove high-frequency noise. The signal noise comes from the sensor itself or from the movement of the body. To remove the noise signal, we use a first-order low-pass Butterworth filter [
31,
32]. Despite the fact that the second-order Butterworth filter provides better results, the first-order low-pass filter is selected because it requires fewer computations and is efficient enough to remove fall-activity-related signals.
Feature normalization is a data pre-processing method that is used to map two or more sets of feature values, with disparities in their respective ranges, to a common range. The variations in accelerometer and gyroscope units such as revolutions per second are unlikely to be equivalent. Furthermore, feature normalization enables our model to converge rapidly while ensuring that the contribution of the feature is equivalent [
33,
34,
35]. Thus, it is necessary to normalize the sensor measurements that have been selected as inputs to our model. It appears that the way of normalizing the features plays an important role in the overall success of the performance of our algorithm. In our approach, the data points for each feature are normalized using the minimum and maximum of each feature. The feature-normalization pre-processing is performed on the 3D acceleration signals
and 3D gyroscope signals
. The general formula is given as follows:
where Feat is the vector of features x, y, and z of the accelerometer and gyroscope, and
is the normalized feature vector calculated using the maximum and minimum values of the vector of features.
2.3. Sliding-Window Segmentation Methods
The collected dataset has a variable temporal duration of the sequence since the data collection time period for each subject is different. Although most research papers [
15,
16,
36] use a more than 2-s sliding-window segmentation in their method, real-time fall-detection applications should respond in at least less than 0.2 s. Therefore, we implement competing sliding-window segmentation techniques to detect falls in near-real time, anticipating that improvement in the hardware efficiencies will ultimately compress this response time. We compare three methods of window segmentation: conventional sliding window, peak-detection sliding window, and dynamic multi-window sliding.
2.3.1. Conventional Sliding Window
The segmentation of signal data plays a critical role in signal recognition and prediction. A sliding-window approach segments the continuous time series dataset into short segments. Many researchers [
15,
16,
36] use the sliding-window approach to engineer the feature of the fall datasets. There are two methods of sliding-window segmentation for fall detection: fixed-size non-overlapping sliding windows and fixed-size overlapping sliding windows. We use fixed-size overlapping sliding windows because it enables us to generate more data by overlapping them. The fixed-size overlapping sliding window technique processes the signal data along the temporal axis based on the window size and stride size (step size). Here, we segment the sample signal into a 0.2 s sample signal for each label using a fixed-size overlapping sliding window. The size of window that overlaps is 0.1 s. If we obtain a single row of falls in this 0.2 s window signal, this window is labeled a fall.
2.3.2. Peak-Detection Windowing
The human gait contains peak acceleration and angular velocity signals when fall and near-fall activities occur. Hence, the need to accurately detect the fall of senior subjects requires us to differentiate the peak signals of near-falls from those of fall events. Due to the alternation in the signal values, we take the highest signal values of the acceleration as the basis to define the observation window of the detector.
Therefore, to use these advantages, we start the engineering of sliding-window features from the maximum magnitude of the acceleration signal (as shown in Algorithm 1) rather than using the conventional engineering of sliding-window features. This kind of windowing helps us to include the important peak signal value in a single window, besides paying great attention to the maximum-value signal. If we used conventional windowing based on the raw signal, the important information may be shared in different windows. For our proposed method, the maximum values of the acceleration and angular velocity simulated in the accelerometer and gyroscope datasets, consisting of 3D acceleration data
and 3D angular velocity
[
22], are calculated using the magnitude of these vectors for the
sample as follows:
Next, the maximum magnitude of the signal calculated above is determined as follows.
where M is the number of samples. Using the above two equations and the size of the window (which is 0.2 s in our case), the sliding window around the peak signal is as shown in
Figure 1.
Algorithm 1 Calculating the maximum value and labeling the window |
|
After we estimate the maximum acceleration and decide on the size of the sliding window, the algorithm above gives us the input features and their labels, which are formed by simply concatenating the six features of the accelerometer and gyroscope,
, where
w is the duration of the observation window (0.2 s) and
the sampling rate of the sensor (200 Hz) [
22]. Finally, the size of the input features (
) that depends on the duration of the observation window (
w) is
.
2.3.3. Sliding Multi-Window
Contrary to the above two methods, the multi-window segmentation approach uses a dynamic main window based on the label of the datasets and sliding conventional sub-windows inside the main window. We use a fixed-size overlapping sliding window for the sub-windows inside the label-based main window. Using the main window based on the label of the datasets and the window size (0.2 s) of the sub-windows, the sliding-window approach looks like Algorithm 2. The observation window is extracted as shown in
Figure 2.
Algorithm 2 Sliding Multi-window |
|
2.4. Handling Imbalanced Datasets
Our collected dataset has 13% of the fall label dataset and 87% of non-fall activities, they are highly imbalanced data. The main problem of not considering such imbalanced datasets is our deep learning models make our minor label classes suffer from low results, although the accuracy of those minority classes is the most important one. We present different methods for balancing the datasets: resampling techniques, data augmentation, and customized loss function. These three methods are compared to balance our training datasets so that the deep learning methods will create results that will not suffer in the minor classes.
2.4.1. Resampling Techniques
The main objective of balancing datasets is to increase the number of the minority class or decrease the number of the majority class. This is performed to obtain an equivalent frequency of instances for both classes. The most familiar resampling approaches for resolving the imbalanced-dataset problem are oversampling the minority class and undersampling the majority class. The well-known undersampling method is called random undersampling, and it randomly deletes window signals from the major classes of the training dataset. The most common and simplest method of this oversampling of the minority class is random oversampling, which simply duplicates random sequences of window signals from the minority class in the training datasets. Undersampling for the majority class loses some of the information, whereas oversampling for the minority class does not lose any data. Undersampling helps to improve the run-time and storage problems by reducing the number of training data samples, even though it loses some of the information, whereas oversampling for the minority class does not lose any data, but it increases the likelihood of overfitting since it replicates the minority class events.
Thus, these duplicated examples do not provide new information to the model. Due to the above reason, we also try to compare another method called the Synthetic Minority Oversampling Technique (SMOTE) [
37] to random undersampling and random oversampling. This approach of balancing datasets helps us to avoid overfitting, which occurs when exact replicas of minority instances are added to the main dataset. A subset of data is taken from the minority class as an example, and then new similar synthetic instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models. SMOTE works by selecting random examples from the minority class that are close to most of the feature variables. Subsequently, its k nearest minority-class neighbors are found, and then the line segment in the feature space is formed to obtain a synthetic example generated randomly in this line segment [
37].
2.4.2. Data Augmentation
A variety of transformations can be applied to data in order to augment it with prior knowledge about its invariant properties. By augmenting the input data, DL models can better cover the unexplored input space, prevent overfitting, and improve generalization [
38]. In image recognition, it is well known that minor changes due to jittering, scaling, cropping, warping, or rotating do not affect the data labels. Wearable sensor data can be transformed in a label-preserving way, but this is not intuitive [
39].
Rotation: Data from wearable sensors are subjected to rotational data augmentation to introduce label-invariant variability [
39]. When the sensor is placed upside down, the readings can be inverted without changing the labels. As a result, 180-degree rotations can be used to simulate upside-down sensor placements by augmenting the existing data. Based on the acceleration vector
for time t, which contains acceleration components along the x, y, and z axes, respectively, a new vector
can be obtained by rotating
180 degrees in the x, y, and z axes.
MixUp: MixUp augementation is a technique that combines two examples of data to enhance them in a more meaningful way [
40]. A temporal MixUp is a method in which all values of a set of features in the first input sequence (observation window) of the sensor are multiplied by a random value, m, and then added, with all features of the same sensor from a second randomly selected input sequence (observation window) of the sensor multiplied by m. Generally, m is chosen between the input sequence’s minimum and maximum and is selected randomly between the two. As a result of this operation, all six features of the accelerometer and gyroscope are affected.
CutMix: By using the CutMix technique, two data samples can be combined [
41]. A temporal CutMix is a method that selects a random time segment from a first input sequence (observation window) and a random time segment from a random second input sequence (observation window) of both labels based on time-segment selection. Both sequences have two random segments that start at the same time step and end at the same time step. Then, CutMix selects a random set of sensors and replaces the channel values of the first multi-feature segment with the channel values of the second multi-feature segment. Whenever the size of the random time segment is determined at random, it ranges between the maximum and minimum values of the hyperparameters. It turns out that the channel probability is a hyperparameter that represents the chance that each channel will be selected for this operation on a given day.
2.4.3. Customized Loss Function
Lastly, we use a customized loss function to handle the imbalanced classes. To train our fall dataset, we customized the focal loss presented by Lin et al. [
42] that is used for object detection. We use our modified loss function used to tune our models parameters while training our fall dataset. Even though the wrongly classified samples are penalized more than the correct ones, in the fall-detection settings, due to the imbalanced sample size, the loss function is overwhelmed with non-fall activity (ADL) classes. Customized focal loss addresses this problem and is designed in such a way that it reduces the loss for the ADL classes, and thus, the network can focus on training the fall classes. During supervised training of the dataset, our deep learning model is optimized end-to-end using a customized weighted focal loss in Equation (6):
where
and
is the number of non-fall classes,
is the number of fall classes, and
is a focusing parameter whose value is
. This focusing parameter is tailored to reduce the influence of higher-confidence classified samples of ADL classes in the loss. The higher the
, the higher the rate at which easy-to-classify examples are down-weighted. If
, a weighted focal loss is equivalent to a weighted binary cross-entropy loss.
2.5. Sensor Positions and Sensor Types
Many fall-detection studies [
43] utilize an accelerometer as the primary sensor to determine falls. Chernbumroong [
43] evaluates the importance of different sensors in a multisensory scheme, concluding that the most relevant features are derived from two particular acceleration components (Z and Y). A gyroscope is considered computationally expensive and is less useful in improving our understanding of the dynamics of falling [
4,
44,
45]. In contrast to this, Nguyen et al. [
46] report that a sensing module (including a gyroscope and an accelerometer) consumes only about 3% more power when both sensors are active compared to only measuring the acceleration magnitudes when only one sensor is activated.
Furthermore, the choice of sensors to use depends on the type of dataset (whether it contains near-fall activities or not). Using only acceleration measurements can result in many false positives and false negatives caused by near-fall activities, such as sitting down fast on a mattress. Near-fall and fall activities have almost the same vertical signal variation, making them difficult to differentiate. False positives and false negatives caused by near-fall activities can be reduced significantly by using the gyroscope’s angular velocity measurements. Our paper compares an accelerometer-based algorithm and a gyroscope-based algorithm intended for fall detection. The results indicate that the performance of the algorithm that uses gyroscope signals is more effective when the dataset contains near-fall activities.
We collected data from sixteen different positions: the head, pelvis, upper and lower thoracic, upper and lower arms (wrists), hand, thigh, shank, and foot. Most studies in the literature review on fall recognition have been conducted using a single wearable device without taking into account the device’s location on the user’s body. The position information provided by the wearable device can, however, assist in improving the performance of the fall-detection system. Several body locations are compared in this study to determine the most optimal location of the sensors for fall-detection systems.
2.6. Proposed Model Architecture
As shown in
Figure 3, the proposed overall model architecture contains three parts: a feature extractor, an LSTM or transformer encoder block, and our linear classifier. In the feature-extractor part, three convolutions with different filter sizes are used to extract features from the segmented input sequence. The extracted features and the input sensor signal are concatenated to obtain 4N, where N is the number of features of the input signals. In the next block, we will use two different types of models to extract the dependency among the input sensor signals. The first has three stacked bidirectional LSTM encoders, and the second has three stacked transformer encoders. Each LSTM or transformer encoder layer is followed by layer normalization and then passes through the Gelu activation function. The global average pooling layer and the dense layer are applied to the result found from the last LSTM or transformer encoder block. Finally, a sigmoid classifier is used for classifying falls from non-fall activities.
2.6.1. Feature Extractor
In a feature extractor block, kernels or filters are applied to project input features onto another meaningful dimension to produce hidden feature maps. Using hidden feature maps as the input for the following layers, the networks can accurately extract or comprehend the original input’s semantic meaning. With a convolutional layer, the kernels (filters) scan multiple cells at once, revealing hidden meanings and effects between them. Convolutional layers are often used because of their ability to extract adjacent cells and faster inference times compared to RNNs.
The conv1D extracts features from a one-dimensional sequence of acceleration and gyroscopic data. The model learns to extract features from sequences of observations and map the internal features to the LSTM or transformer encoder. In order to extract dense feature representations of our input signals for effective learning, three different representations of features with varying filters sizes are computed.
2.6.2. LSTM Encoder
For sequence modeling, long short-term memory (LSTM) has been considered a highly successful RNN architecture. When time is a relevant feature, the easiest way to handle it is to concatenate the time features with the input and use the LSTM model. Due to their ability to learn how and when to forget and when not to use gates, LSTMs and their bidirectional variants are popular.
When sequential data are processed, the LSTM often ignores future information. Based on LSTM, BiLSTM processes the data in the series forward and reverse, connecting the two hidden layers to the same output layer [
47] and storing previous and subsequent information as the current time basis for the time series data [
48,
49]. Accordingly, multi-directional LSTM is more accurate than unidirectional LSTM. In BiLSTM [
50], the hidden layer output includes both forward and backward activation outputs.
where W and b are weight matrices and vectors, t is the current iteration of the recurrent network,
is the activation function,
is the input of the hidden layer, and the output is generated by updating the forward structure
and the backward structure
.
2.6.3. Transformer Encoder
For the transformer model to function properly, we need to attach the meaning of time to our input features. In the original NLP model, a collection of superimposed sinusoidal functions was added to each input embedding. We now require a different representation because our inputs are scalar values and not distinct words (tokens). After we concatenate the input features to our transformer encoder, we encode the sequence of time (time embedding), which is hidden in our signal data. We implement the existing Time2Vec method [
50] for time embedding. This time embedding is a vector representation just like a normal embedding layer that can be added to the neural network architecture to improve the performance of a model and overcome the temporal indifferences of the transformer. The mathematical representation of Time2Vec shown in Equation (9) for the ideas of periodic and non-periodic patterns, as well as the invariance to time rescaling, are used [
50]:
where
is a periodic function and
and
are learnable parameters.
In this paper, we use the transformer encoder based on the ViT [
27] that is developed for image recognition. In a recent publication, Vaswani et al. proposed the implementation of multi-head attention (transformer) [
51]. The functionality of a multi-head attention layer is to concatenate the attention weights of n single-head attention layers and then apply a nonlinear transformation with a dense layer. Having the output of n single-head layers allows the encoding of multiple independent single-head layer transformations into the model. Therefore, the model can focus on multiple time series steps at once. Increasing the number of attention heads affects the ability of a model to capture long-distance dependencies positively [
52].
3. Results and Discussions
We use balanced accuracy and F1 score performance metrics to compare the different methods. Balanced accuracy is a raw accuracy, where each sample is weighted according to its actual class’s inverse prevalence. It helps us deal with imbalanced datasets by avoiding inflated performance estimates in our datasets. If the classifier performs equally well on either class, this term reduces to the standard accuracy. In contrast, if the classical accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to
.
where
,
, and
. The experiments have been implemented using the PyTorch framework and the labeled datasets used for training, validation, and testing with 50%, 20%, and 30%, respectively. We run the training twenty times to obtain the average results with a learning rate of 0.0001, a batch size of 64, and 1300 epochs.
Furthermore, we conducted five different experiments so that each experiment was designed with specific goals, including comparison with existing fall-detection methods, finding the optimal sensor position for fall detection, and comparison of different techniques to balance the fall datasets. The experiment and results obtained are discussed below.
3.1. Comparing the Different Proposed Data Segmentation Methods
The purpose of this experiment is to evaluate the impact of sliding-window segmentation on deep-learning-based fall-detection approaches. It has not been evaluated how segmentation impacts fall datasets. An inappropriate segmentation method, such as the conventional sliding window, can decrease the classifier’s overall performance. We found that using multi-window segmentation increased the F1 score for our dataset. There is a difference in the length of the fall data between the different datasets, and the highest F score can be obtained when the window size is the same as the fall data. When the length of the fall data is known and uniform, selecting the window size is trivial. The real world, however, includes fall data of varying lengths. As this dataset has a varied length of fall data, the results from our dataset, shown in
Table 1, can more accurately reflect the classifier’s performance in real time. Multi-window segmentation has a better result than the other two segmentation methods.
Although conventional sliding windows use overlapping windows, the window signal for a fall event does not always align perfectly with the fall patterns observed in the input signals. As a result, our classifiers would ordinarily produce false negatives and false positives. To solve this problem, we propose more robust data-segmentation methods. When windowing from peak signals, the window is aligned with the fall patterns of the input signals. Nevertheless, capturing the entire fall in one observation window remains difficult. Using multi-windows, we can segment the data according to the class of labels. Multi-windowing segmentation not only provides better results but also helps detect falls on a real-time basis (0.2 s). Because the data are segmented based on the label, we can make the input sequence (observation windows) small.
3.2. Experiment on the Proposed Techniques for Handling the Imbalanced Datasets
This experiment aimed to compare the robustness of the proposed model for imbalanced datasets. We compare different methods for handling imbalanced datasets. The customized loss function and data augmentation improve the specificity and sensitivity of the model, resulting in a balanced accuracy of 0.92 and 0.98, respectively. However, neither resampling technique reduces the number of false positives nor false negatives. Multi-window segmentation data from the shin-mounted sensor are used to train these balancing techniques.
Table 2 shows how all the methods performed in the evaluation set. In contrast to image-classification problems, our results indicate that downsampling and upsampling did not help the sensor signals.
3.3. Comparison of Fall Detection with Different Sensor Positions
The purpose of this experiment is to demonstrate the capability of the proposed LSTM-based model for predicting the optimal position of the sensors for fall detection based on the observation window generated with the help of multi-window sliding. Therefore, we train a transformer encoder model using fall datasets (accelerometers and gyroscopes) from all 16 sensor positions and evaluate it using the test set. The average performance of the left and right sensor positions of the forearm, upper arm, hand, shank, hip, and foot is used to calculate the performance of the forearm, upper arm, hand, shank, hip, and foot, respectively. Furthermore, the thoracic result is the average of the upper and lower thoracic locations. We measure the model’s balanced accuracy and F1 score to evaluate its performance. Based on the LSTM encoder-based model, we present the results for various sensor positions in
Table 3. For the shank position, the model achieves a remarkable F1 score of 0.97. In general, the F1 score of the model is higher than 0.90, except for the foot and hand, where the F1 score is 0.88 and 0.85, respectively. The results indicate that different sensor locations give us various performances and that the shank is the optimal location for the sensors.
3.4. Experiment on Fall Detection with Different Sensor Types
The data obtained from the three-axis accelerometer and gyroscope are combined to obtain information about the global acceleration and orientation of the body, which in turn are inputs into the algorithm. One of the objectives of this study was to examine and compare the relative effectiveness of accelerometer-based algorithms and accelerometer-plus-gyroscope-based algorithms intended for fall detection as part of our comparative study. In the near-fall dataset, we find that the algorithm that uses gyroscope signals has a markedly better performance than the algorithm that does not use gyroscope signals. As shown in
Table 4, using a gyroscope in conjunction with an accelerometer as part of the detection process can potentially lead to incremental benefits compared to using an accelerometer for the detection process.
Therefore, to detect falls with low energy consumption, we use acceleration along the x, y, and z axes and angular velocity along the x, y, and z axes. The evidence gathered in this study suggests that assessing wearable sensors located on the shins through acceleration and angular velocity features may represent an optimal combination to discriminate falls from non-falls (including near-falls).
3.5. Comparing Our Proposed Methods to Existing Methods
As a final experiment, we compare the performance of our proposed methods with LSTM and CNN-LSTM [
16,
23]. This experiment proposed a fall-detection study using competing deep learning methods that use multi-windowing and competing methods of balancing the class data samples. We implemented the existing methods to compare them with our method and evaluate them using our dataset.
Table 5 and and the confusion matrix in
Figure 4 compare the best results of the LSTM encoder and transformer encoder with those of [
16,
23]. Compared to [
16,
23], which reported mean F1 scores of 0.89 and 0.86, respectively, the proposed LSTM encoder performs better with a mean F1 score of 0.97 for shank datasets. Moreover, the proposed transformer encoder method with multi-window segmentation identified falls against non-fall activities with an average F1 score of 0.96 for the shank datasets. These performances show that our proposed approach improves the results by using multi-windowing segmentation and a customized loss function for handling imbalanced datasets while using six features (three acceleration and three angular velocity features).
Lastly, we visualized the actual and predicted classes of the LSTM-based model for the sample data points. As shown in
Figure 5, the visualization shows the data points and their corresponding predicted and actual classes. In the third column of the first row, the ADL, a near-fall activity, is correctly classified as a non-fall activity. This shows that our model is stable even for near-fall events.
4. Conclusions and Future Works
In this study, we develop a deep learning method for wearable sensor-based fall detection in a real-time setting. First, we implement and evaluate the impact of different types of sliding-window segmentation on deep-learning-based fall-detection methods. To our knowledge, there are no studies on the evaluation and implementation of different sliding-window segmentation methods other than the conventional sliding window for wearable sensor-based fall detection. Using multi-window segmentation helps us to detect falls in real time (0.2 s). Using a gyroscope in conjunction with an accelerometer as part of the detection process can potentially lead to incremental benefits compared to merely using an accelerometer for the detection process. Another important finding is that the use of convolution-based learned features extracted with different kernels tends to increase the performance of the classifier. Moreover, from the performance presented above, it is better to use the customized loss function than resampling techniques and data augmentation for handling the imbalanced nature of the fall datasets. Balancing the sample dataset using upsampling of the minor class or downsampling of the major class did not help.
Although using wearable sensors has the advantage of flexibility and privacy over camera-based fall detection, camera video recording is still used for manual data annotation. Data labeling for supervised machine learning methods is expensive, as we use video recording and playback of each patient’s video to label the data. Future work in this study will employ self-supervised and unsupervised machine learning algorithms. We will create a large database by segmenting the input sequences from different positions with different sensors using multi-window or peak-detection windowing techniques and feed it to a variational autoencoder with LSTM and a transformer to perform unsupervised clustering.