1. Introduction
There is a growing trend in swimming, as in many other sports, to monitor human physiological function, technical skills, and performance during in situ training. Indeed, current swimming training programs make monitoring of the swimmer’s training load and performance a key concept [
1]. To do this, several performance devices and sensors are becoming more readily available for athletes and allow performance to be quantified more precisely [
2,
3]. Among conventional devices, global positioning systems cannot be used during indoor swimming while cameras suffer from optical occlusions and data processing that is non-automatic as well as time consuming, making reliable assessment hard to achieve. To overcome these limitations, inertial measurement units (IMU) have become a relevant solution for monitoring and performance analysis [
4]. Moreover, they do not require any external equipment, are not confined to restricted capture areas, and have the ability to continuously monitor swimmers in a real environment over a prolonged period without huge technical and logistical pressure [
5]. While IMUs are now widely used for human movement analysis in various fields, such as clinical, ergonomics, and sports [
2], swimming possesses features that make it distinctive in comparison to other sports. Indeed, swimming is characterized by the existence of four strokes that include large 3D sculling movements along with different intra-cycle phases (glide, catch, pull, recovery) and transitional phases (the start, turn, and underwater phases). Moreover, the design of a swimming training session is carried out on the basis of work and recovery times and exercise intensities that impact the biomechanics of swimming [
6,
7]. This observation provides meaning and is an opportunity to provide valuable indicators (e.g., lap and rest times, stroke count and frequency, time spent underwater) in the frame of a monitoring process. Another feature lies in the medium in which the swimmer moves, which may influence raw IMU data. Indeed, gyroscopes are highly sensitive to temperature variation [
8,
9,
10], which may engender significant drift, particularly for temperatures higher than 20 °C [
11]. Another consideration is due to the oscillations between body and device, i.e., soft tissue artefacts (STA) [
12], which compete with drift as the most crucial source of error. As STAs are both motor task- and subject-dependent [
13], it can be assumed that these oscillations in water differ from those recorded on land, despite fixation techniques that may limit the oscillations. These considerations (biomechanical features and specificity of the medium in which the swimmer moves) clearly illustrate the requirements for the development of a specific method dedicated to the analysis of swimming activities based on IMUs. Several valuable pieces of information can be extracted from IMU data during swimming phases, as well as during the underwater and turn phases [
4,
5,
14]: (1) temporal parameters, e.g., lap and rest times; (2) instantaneous kinematic parameters, e.g., intracyclic variations; and (3) stroke characteristics, e.g., stroke count and frequency and time spent underwater. In a monitoring context, the identification of at least all swimming phases (butterfly, backstroke, breaststroke, frontcrawl) and transitional phases (start, turn, and underwater phases) along with lap and rest times is required. This can be achieved through the development of a human activity recognition (HAR) methodology.
In this regard, various IMU-based classification algorithms have been proposed in swimming [
15,
16,
17,
18,
19,
20]. Most published swimming activity recognition methods select features from a set of engineered metrics by relying on conventional mathematical operations in the time and frequency domains [
21]. These methods include detection of extrema and/or zero-crossing and filtering techniques applied to the signals based on a set of predefined thresholds. Using an accelerometer located on the sacrum, Refs. [
17,
22] proposed a threshold-based methodology from a combination of orientation and component energy to determine the stroke type with 96.1% accuracy. This method, sometimes combined with a gyroscopic signal, has been widely used in the literature for stroke recognition [
23,
24,
25]; as well as for the wall push, turn, and touch identification sometimes used for lap time (LT) calculation [
19,
25,
26,
27,
28,
29,
30,
31]. LT calculation refers to identified peaks in the acceleration signal. However, this threshold-based technique can lead to mistakes, as false positive peaks such as those induced by a powerful leg kick may be confused with the beginning or end of a lap. Despite their ease of use, these methods suffer from several drawbacks: (1) it is necessary to find an optimal parameterisation for the processing functions in order to set the population under study; (2) it is relatively difficult to deal with the inter- and intravariability of IMU signals due to the differences in athletes’ technique; and (3) it has difficulty handling environments with interference from other swimmers. These factors compromise the ability of threshold-based techniques to provide automatic robust classification without generalisation issues.
Recently, thanks to improvements in computing capabilities, many studies have used machine learning (ML) approaches that can automatically classify sequences of features or directly learn features from the signal in order to increase the accuracy of general human activity recognition [
2]. More specifically, for swimming activity classification these approaches include Support Vector Machine (SVM) [
32], Random Forest (RF) [
33], and Principal Component Analysis (PCA) [
15] approaches. These methods have been mainly used for feature selection in order to train a classifier for swimming style classification. Such methods are often designed from large time windows for feature extraction, resulting in poor temporal precision that may not be suited for monitoring elite swimming. Thus, one recent paper compared sliding window techniques to an intra-stroke segmentation technique and pointed out the possibility of performing stroke-by-stroke analysis [
20]. Moreover, depending on the objectives with which the model is developed, the cross-validation technique has to be justified. When working with an identified group such as a swimming team, it can be of great interest to develop a specific model for the team in order to make the classification more informative and precise. In this case, either (1) a holdout or (2) k-fold cross validation can be sufficient [
34]. However, if the aim is to develop a model that can be generalized to wide population panels, it is essential to separate the training, validation, and testing sets with different subjects in order to assess the generalization power of the model using a subject-independent cross-validation technique. For HAR, two main methodologies of cross-validation are detailed in the literature: (1) leave-one-out cross-validation, which typically has a high computational cost; and (2) hold-one-out cross-validation, which is generally used when the algorithm requires considerable computation to iterate, as with deep learning (DL) models [
21].
The last approach takes advantages of recent enhancements in Deep Learning (DL). Indeed, such techniques have made it possible to achieve promising accuracy, reproducibility, and temporal precision in exhaustive human activity recognition. DL refers mainly to neural networks that exploit many layers of nonlinear transformation processing for non-human-dependant feature extraction and classification. They are organised hierarchically, with each layer processing the outputs of the previous layer. DL for time series classification relies on the ability to automate the critical feature extraction module via learning from signals, using, for example, layers in Convolutional Neural Networks (CNN) or in Long Short-Term Memory (LSTM). Previous studies using IMU data have reported that using DL is likely to surpass conventional ML algorithms in HAR [
35]. More specifically, in Swinmming Activity Recognition (SAR), CNN-based methods, which can automatically extract discriminative features with convolutional kernels, have demonstrated better and more generalizable performances than conventional methods using predesigned features such as descriptive statistics [
16]. An interesting DL approach for human swimming style recognition and lap counting has been developed in which a convolutional neural network (CNN) with high performance was successfully used in swimming style recognition. Furthermore, studies based on recurrent neural networks (RNN) such as LSTM have demonstrated interesting results for SAR [
36] with a Bi-LSTM (Bidirectional Long Short-Term Memory) network. Data records were collected from 40 swimmers and were labeled into eight classes: Unknown, Null, Freestyle, Breaststroke, Backstroke, Butterfly, Turn, and Kick. The Bi-LSTM method was able to perform activity classification with an average F1 score of 91.39%. To attain such performance, the network used pre-extracted statistical features as inputs instead of direct IMU signals. However, LSTM and Bi-LSTM are designed to process and make predictions from available sequences of data. In contrast, CNN is designed to exploit “spatial correlations” in data, making them perform well when identifying shapes from images [
37]. Therefore, it would be interesting to train a bi-LSTM-based model on swimming without pre-extracted features, as in [
36] except with raw IMU data, in order to input data to be temporal series. All of the above-mentioned studies relied on publicly available [
16] databases with a non-video-based labelling of activity into five classes: the four swimming techniques and a transition phase including the rest of the swimmer’s activity (turn and break). Despite valuable contributions, this approach requires the development of new classes to provide more widely applicable insights for monitoring in real training conditions. From this perspective, it is be necessary to integrate other essential variables of training control, such as LT and a classification separating the underwater, turn, and rest phases. To the best of our knowledge, ML and DL have never been used to compute LT automatically. Finally, while most of these models are well suited to considering homogeneous data, they show large performance drops in ecological conditions or when applied on high level populations, for example. Indeed, many papers developed their models with a homogeneous population [
25] and/or at homogeneous paces [
15]. Therefore, it may be important to gather a database in which the participants have heterogeneous levels of swimming (inducing inter-subject variability) and swim at heterogeneous paces during their training (inducing intra-subject variability) in order to ensure that the model can be widely generalized.
This study is primarily aimed at developing a deep learning model devoted to the analysis of swimming using a single IMU attached to the sacrum. It should be able to classify swimming activities at the different velocities that may occur during a full training session. Moreover, the proposed methodology should be generalizable to a wide panel of swimmers through the use of a database with high inter- and intra-swimmer variability. Secondary purposes are to compute LT directly from the classifier with high temporal precision and to validate LT values using a video gold standard.
3. Results
3.1. Performance of Swim Activity Recognition
The results presented here are based on the predictions made by the model for each iteration of the sliding window. Based on the video labeling, the model is supposed to predict the performed activity at the median frame of the sliding window, which in this work is the 45th frame of each window. They are presented in
Table 5.
Prior to the filtering step, the average overall precision on the testing set was 0.77 and the weighted average precision was 0.92. As the dataset was unbalanced, assigning weights to the classes as a function of their number of samples tended to improve the precision, as the model trained better on those classes. Short classes such as wallpush, turns, and underwater had lower precision, respectively, at 0.17, 0.58, and 0.71. However, activities that were more well represented in the dataset (strokes and rest) were predicted more precisely, as 0.98, 0.96, 0.83, 0.96, and 0.99, respectively, for the butterfly, backstroke, breaststroke, front crawl, and rest.
This tendency remains the same for recall, with lower values for short classes and higher values for stroke styles and rest. Indeed, recall was 0.08, 0.71, and 0.71 for the wallpush, turns, and, underwater phase, respectively, and 0.83, 0.94, 0.95, 0.96, and 0.99, respectively, for the butterfly, backstroke, breaststroke, front crawl, and rest.
The combination of those last two metrics, that is, the F1-score, was 0.91 on the testing set. F1 scores per phase were between 0.11 and 0.99 for wallpush and rest, respectively. For the four swimming techniques, F1 scores were between 0.89 and 0.96, with 0.90, 0.89, 0.95, and 0.96, respectively, for the butterfly, breaststroke, backstroke, and front crawl. For the non-swimming phases, the underwater and turn phases had F1 scores of 0.91 and 0.82, respectively.
The second step of the model, filtering, removed artefacts due to misclassifications . Whatever the class, performance metrics were systematically improved after filtering. The average overall precision on the testing set increased to 0.88 and, the weighted average precision increased to 0.96. Short classes such as wallpush, turns, and underwater had lower precision, at 0.53, 0.75, and 0.92, respectively. Strokes and rest reached 0.99, 0.98, 0.89, 0.99, and 0.99, respectively, for the butterfly, backstroke, breaststroke, front crawl, and rest.
This hierarchy remained the same for recall, with lower values for short classes and higher values for stroke styles and rest. Recall values were 0.19, 0.91, and 0.91 for the wallpush, turns, and underwater phase, respectively, and 0.88, 0.99, 0.98, 0.99, and 0.99 for the butterfly, backstroke, breaststroke, front crawl, and rest, respectively.
Finally, F1-score reached an overall value of 0.96. However, results showed heterogeneity in the performance repartition. Indeed, the wallpush F1-score was 0.28, while the rest F1-score was 0.99. For the four swimming techniques the F1-score was between 0.93 and 0.99, at 0.93, 0.94, 0.99, and 0.99, respectively, for the butterfly, breaststroke, backstroke, and front crawl. Regarding non-swimming phases, the underwater and turn phases reached an F1-score of 0.91 and 0.82, respectively.
3.2. Lap Time Assessment
The agreement between the lap times obtained from the IMU and the lap times obtained from the gold standard video are presented in
Table 6.
LT IMUs are slightly overestimated, with an error (systematic ± random) of 0.06 s (−0.05; 0.14) ± 0.60 s (0.57; 0.63) and MAPE of 1.77 ± 1.82%, corresponding to 0.42 s ± 0.43.
Figure 3 presents the Bland–Altman plot and density distribution of the differences between LT IMU and LT CAM, summarized in
Table 6. Reported errors are mainly around
s.
The agreement between the lap times obtained from the stopwatch and those obtained from gold standard video are presented in
Table 7.
LT MAN are slightly underestimated, with an error (systematic ± random) of −0.10 s (−0.14; 0.07) ± 0.58 s (0.55; 0.61) and MAPE of 1.43 ± 2.04%, corresponding to 0.34 s ± 0.48 s.
Figure 3 presents the Bland–Altman plot and density distribution of the differences between LT MAN and LT CAM, summarized in
Table 7. Reported errors are mainly around
s.
Lap Time Type Analysis
The statistical agreement between LT IMU and LT CAM according to the type of lap is presented in
Table 8, and Bland–Altman plots are shown in
Figure 4.
The greatest error (systematic ± random) was for LT END, with 0.25 s (0.13; 0.38) ± 0.98 s (0.90; 1.08), whereas the errors are lower and in the same range. Indeed, LT START and LT MIDDLE revealed errors of 0.01s (−0.04; 0.06) ± 0.39 s (0.36; 0.43) and 0.00 s (−0.03; 0.03) ± 0.37 (0.34; 0.39), respectively.
4. Discussion
This study aimed to develop a deep learning model devoted to analysis of swimming using a single IMU attached to the sacrum. In particular, the proposed methodology was set up for the purpose of classifying swimming activities at several swimming velocities that may occur during a full training session. A second purpose was to assess the performance of the model in automatically calculating lap times during the exercise.
Previous studies investigated lap detection and/or swimming technique identification using a single IMU sensor located on the sacrum [
27], head [
50], chest [
18], or wrist [
16], or in multiple sensor locations [
15]. In the present study, the sensor was placed on the sacrum, which is a convenient placement in terms of comfort, safety, and minimal obstruction of movement [
20]. Moreover, a recent study has shown that the highest performance in both lap detection and swimming technique identification were achieved with a sensor placed on the sacrum [
15].
Most previous papers collected data from relatively homogeneous groups of swimmers. Indeed, the data used for the classification algorithm settings independently considered elite level [
20], college level [
18], national second league level [
16], and national level [
15] swimmers. Thus, each of these studies were likely to rely on similar swimming techniques. This homogeneity in terms of the swimming level may affect the process of model training, as the resulting model may fail to generalize the algorithms to swimmers with different skills and levels. In the present study, a strong restriction on swimming level was imposed in the inclusion of participants. As the present database included elite and non-elite swimmers from the regional to the second league level to participation in the national championship, the dataset used here was representative of a large variety of stroke techniques and levels. This suggests that the proposed model can be used reliably with a wide range of swimming proficiency levels. Moreover, many previous studies involved small numbers of participants, i.e., N = 3 [
51], N = 11 [
52], N = 13 [
25], and N = 17 [
15]. Apart from [
16,
18], who collected data from a large number of swimmers, i.e., 45 and 40, respectively, to the best of out knowledge, the data collection presented in this manuscript represents one of the largest databases collected in the literature to date regarding IMU-based swimming activity classification. Moreover, with strong inter- and intra-subject variability, it can be hypothetised that the present database covers most of the swimming skills and techniques needed to train an in-field generalisable classifier. Furthermore, the literature highlights a clear imbalance in terms of the different swimming actions analyzed. Indeed, most previous studies were restricted to specific swimming techniques or phases. While most classification algorithms have focused on the front crawl and backstroke, studies investigating the butterfly and/or breaststroke are more scarce. Moreover, apart from [
15], the identification of swimming microphases, including turns and underwater phases, have not previously been tackled. However, the identification of those phases is decisive in analysis over a full training session, or at least during a swimming set. The deep learning model presented in this manuscript is able to distinguish eight classes, including wallpush, underwater, turn, rest, and the four swimming techniques. Consequently, beyond traditional metrics used to analyse swimming performances during training, this model should be able to monitor a swimmer’s performance on non-swimming phases (turn time, underwater time) as well, and could therefore allow new metrics to be derived, for example, underwater distance covered, which is nowadays an important part of the final performance. Another original aspect of the present manuscript lies in the ability to detect rest periods, which is a crucial components of training monitoring. Furthermore, most previously published studies have focused on homogeneous swimming intensities, introducing poor intra-subject variability. Despite high levels of accuracy obtained, this can raise the question of whether the model can be generalized to different swimming velocities. Indeed, Ref. [
15] raised the hypothesis that only machine learning methods may be efficient to deal with inter- and intra-swimmer variability in terms of technique. Finally, incorporation of all four swimming techniques combined with different intensities and variability in the level of expertise during the training stage is the starting point to produce a robust model, which is an important contribution of the present manuscript.
To the best of our knowledge, little attention has been paid to deep learning in the context of swimming analysis using IMU data. However, deep learning is a powerful solution to the development of models adapted to a wide panel of users with generalisation performances that can be controlled during model training. Such models are now used for high complexity classification problems such as computer vision. Therefore, such algorithms may be suitable for the complexity of HAR. Recurrent neural networks such as LSTM networks are considered one of the most efficient approaches in learning dynamics from time series [
53]. Moreover, this problem of classification involved sequences for which all time steps are available before performing the prediction. Consequently, it is possible for the network to learn dependencies in both directions of the signal. Indeed, bi-LSTM networks were first designed to learn dependencies on the input sequence as-is and on a reversed copy of the input sequence [
42]. Furthermore, the problem of classification faced here is a multivariate classification problems with non-linear temporal dynamics as input. For this reason, the model architecture used several bi-LSTM layers. LSTM layers have memory cells that act as an accumulator or a gated leaky neuron [
37]. Increasing the depth of the network allows recombining the learned temporal representations from previous layers to increase the level of abstraction with the new representations [
54]. This use of bi-LSTM for swimming classification with temporal series as inputs and eight label classes as target outputs is, to the best of the authors’ knowledge, the main contribution of this model to the state of the art. From a more technical perspective, a difficulty was in reaching a good balance between the number of parameters important enough to be used in modeling the whole classification problem without leading to any important variance during the use of unknown data (this variance would be a consequence of overfitting). In order to reach this objective, we tried to make the ratio between parameters and the number of samples input to the model ultimately tend towards 0. In order to fulfill this objective, an architecture with four successive layers of bi-LSTM was chosen, with a decrease in units followed by one dense layers. Moreover, important care was taken to not overfit the data during training. To achieve this, deep learning techniques such as dropout [
55] and recurrent dropout [
56] were applied in the bi-LSTM layers and dense layers during training. Furthermore, the performance between the training, validation, and testing sets was compared in order to control the tradeoff between bias and variance.
This methodology finally led us to predict swimmers’ activity for each frame recorded during their activity, i.e, every 0.02 s. Recent studies have developed novel approaches to increase the temporal precision of such predictions, such as the one in [
20]. More specifically, these authors used a synthetic minority oversampling technique (SMOTE) [
57] relevant for microphase analysis. The only paper considering numerous microphases is that of [
15], who performed SAR in the same eight classes we investigated, namely, wallpush, underwater, turn, rest, and the four main swimming techniques. However, with a single intensity at 80% of maximal speed, their database suffered from a lack of generalisation; this would require various intensities to be encountered during training. The present manuscript, in contrast, tried to attain high temporal precision with numerous types of activities predicted at very different speeds and intensities.
The literature regarding SAR is composed of three main methodologies: signal processing methods, classical machine learning methods, and deep learning methods. Signal processing methods are the most investigated methodology and have shown good results, with up to 100% stroke recognition and up to 99% the lap segmentation [
25]. Regarding machine learning methodologies, they have shown good results, with an accuracy up to 98.63 ± 1.9%, 99.04 ± 0.91%, 99.10 ± 1.43%, and 97.24 ± 1.71% for butterfly stroke, breaststroke, backstroke, and front crawl, respectively [
51]. Furthermore, Ref. [
58] showed that combining several consecutive predictions led to 100% good predictions. More recently, Ref. [
20] reached F1-scores greater than 0.99 with a stroke-by-stroke approach, realizing better temporal precision in the identification of stroke types. Finally, Ref. [
16] used a deep learning method to reach an average F1-score of 97.4. In the present manuscript, with F1 scores of 0.93, 0.99, 0.84, and 0.99 for the butterfly, backstroke, breaststroke, and front crawl, respectively, the results are in the same range as the results of most previous studies with respect to stroke recognition.
More generally, the choice presented in this paper, i.e., to make predictions using a sliding window overlapped frame per frame, showed excellent performance over most of the eight classes. Indeed, the weighted averaged overall F1-score after network training and filtering was 0.96, although the F1-score reached by each class showed a high variability between different classes’ precision and recall. Indeed, very short classes such as wallpush and turns were those with the poorest precision, with F1-scores of 0.28 and 0.82, respectively. This poor identification may be due to the duration of phases that are very short, leading to a lower representation in the database of those phases and consequently to fewer opportunities for the model to train on such temporal configurations. However, identifying a wall push is not systematically necessary to identify lap times, because the identification of pre- and post-phases may be sufficient. Moreover, the analysis of the classification report showed that wallpush is most of the time confused with either the underwater, turn, or rest phases, which are the adjacent classes. This analysis suggests that there is not a misclassification of wallpush with other classes; rather, the model has difficulty identifying boundaries between turns, underwater and rest phases, and wall pushes. Moreover, most wall pushes appeared to last less than a few tenths of second. Therefore, using the adjacent classes does not alter the precision of our results. Another tendency that can be pointed out is that of the underwater phase to be misclassified with butterfly and breaststrokes in certain cases. This intra-phase confusion can be explained by the similarity between the sacral kinematics of underwater phases and simultaneous stroke styles that include undulatory movements. However, for classical phase identification the performance is excellent, as previously mentioned. Thus, the model presented in this manuscript may be promising for a wide range of applications. Several key variables in the monitoring of swimming training can be derived from the prediction of swimmers’ activity. Indeed, the time spent underwater and turning time are variables of transition phases that, while poorly investigated by coaches, represent up to one third of the final performance [
59]. In addition to the classical variables (mean swimming speed per lap, lap times, etc.), the use of this model may be of great value for elite swimmers and coaches.
However, in order to make further comparisons with the existing literature, it has to be noted that whereas most of the previous studies performed prediction for a session or window with a size ranging from tenths to tens of a second, leading to macro-prediction of swimmers’ activity, this model performs a prediction at every time step, i.e., every 0.02 s. Consequently, it is difficult to compare the performance of this model with the existing literature using predictions with a significant difference in temporal precision. Moreover, several authors adjusted their algorithm to the homogeneity of their population [
25], which is convenient when working with a small and identical group but not generalizable when working with wide panels of unknown swimmers. Therefore, another contribution of the present manuscript is the development of a model suited for any level of swimmer and intensity of swimming, regarding the variability embodied in the database. Such precision has, to the best of our knowledge, never been reached in the previous literature aiming to classify swimming activity with machine learning models.
Our results on lap times highlights a minimal loss compared to stopwatch measurement (3.84% vs. 12.12%), demonstrating the relevance for coaches of such an automated method. This may help coaches to automatically monitor many swimmers at the same time. In comparison with previous works based on thresholds, the present results based on a large data base show lower differences between IMU-based and video-based lap times. Indeed, a difference of 0.06 ± 0.6 s was obtained with n = 952 samples, whereas previous works reported differences of 0.72 ± 0.26 s with n = 132 [
25] and −0.32 ± 0.58 s with n = 164; Ref. [
22] reported a typical error of measurement (TEM) of 0.6 s between their model and the stopwatch. In this study, the TEM with the stopwatch was 0.58 s. This systematic error is nearly identical between our model and an expert coach taking lap times with a stopwatch, reinforcing the confidence a trainer can have in such an approach. To fully investigate this criteria, an excellent statistical indicator seems to be the MAPE. In this paper, the MAPE was 1.15% for LT START, 1% for LT MIDDLE, and 4.07% for LT END. For comparison, Ref. [
60] found an MAPE of 3.22% for the TritonWear device (TritonWear, Toronto, ON, Canada) over a 144-bout medley, without distinguishing between bout localization. This paper is therefore a strong contribution towards accurate lap time prediction, which can help in gathering a wide panel of swimming performance data. Gathering data with such precision would be of great interest for coaches, allowing them to better monitor swimmers’ performance in a longitudinal way, as well as for scientists, whose objective would be to model swimming performance across seasons. Moreover, to the best of our knowledge, the separation between LT START, LT MIDDLE, and LT END has never been investigated in the literature with such precisions. However, further improvement in the accuracy of this measurement are very possible. Indeed, whereas good homogeneity between LT START and LT MIDDLE can be highlighted, LT END shows the largest MAPE. This discrepancy with other LT elements can be explained by the transition between a swimming phase and a rest phase, which sometimes may not be easily identified when the swimmer does not actually touch the wall at the end of a training session. This finding is in agreement with other studies that report difficulties in identifying the final touch [
22,
25,
30].