1. Introduction
Quantitative analysis of human movement has long been an interest within sports biomechanics for its ability to determine performance and strategy [
1], as well as its application in rehabilitation to identify injury risk factors [
2] and facilitate recovery [
3]. The demand for motion analysis to capture more complex environments in sport is pushing for the development of faster, more autonomous, and sophisticated techniques. Biomechanical analysis in applications such as training and competition requires the following unique criteria: provide accurate kinematic data, deliver information in a timely manner, and remove factors which restrict or influence the subject’s natural movement [
4].
The most widespread and common techniques for kinematic data capture have historically been manual notation on prerecorded videos and marker-based technologies. However, they are not without their drawbacks. Manual notation involves replay of game film and the manual localization of joints of interest for each sequential frame and from each camera perspective if required [
5]. In the implementation by Sanderson and Way, notational analysis was used to describe sequential strokes in squash using symbols and court plans for positional information [
6]. This method is affordable and does not require attachment of markers but remains a time-consuming and laborious task prone to subjective error. Marker-based systems utilize multiple cameras and markers placed on the specific joints of the subject to locate the position of the body. Many commercially available systems use automatic optoelectronics which require subjects to place passive reflective markers on their body usually in the form of a suit, as reported by Richards who reviewed passive marker systems including the Ariel system, Motion Analysis’ HiRes system, Peak Performance’s Motus system, Qualisys’ ProReflex system, BTS’s ElitePlus system, and Vicon’s 370 system [
7]. The passive markers work with multiple cameras that emit invisible infrared light and reflect the infrared light back to the cameras. The cluster of markers improves time efficiency as it allows for quick location of the subject. With decreased processing time, limitations remain due to the markers which include long participant preparation time and inevitable variability in placement [
8]. There is also a higher possibility of rigid body assumptions being violated [
9]. Additionally, markers cannot be used in most competition settings due to the physical and/or psychological effects from having extra attachments. Further, because these methods require specialized equipment, data collection is restricted to local participants, limiting the ability to study elite level athletes from across the world. These limitations have motivated the development of motion analysis systems towards an autonomous, markerless approach using deep learning and computer vision. These applications are mostly on slow movements such as walking or jogging and have remained in laboratory analyses as studies evaluate the accuracy of their systems with the use of multiple cameras [
10,
11,
12,
13].
More recently, computer vision has been applied to player tracking for indoor sports. Perš and Kovačič [
14] focused on tracking handball players using two cameras and performing applied frames subtraction based on motion detection, template tracking, and color-based tracking. Another system applied to handball was discussed by Santiago et al. [
15] who proposed a player tracking method based on image and colors processing. Both methods require specific equipment, a special camera setup and present tradeoffs between manual intervention and analysis times. Specifically for squash, a computer vision driven tracking system was developed using the HOG (Histogram of Oriented Gradients) algorithm implemented within the OpenCV library for player detection [
16]. In this method, detection is of the general player rather than definitive tracking of the feet causing valuable kinetic information to be lost. In addition, it requires a special camera setup with prior court calibration and it is noted that the algorithm suffers from poor response time.
Few papers study player kinematics and kinetics of squash. In past studies, Hughes and Franks [
17] investigated the correlation of velocity and acceleration in the last ten seconds of a rally for winners and losers. To collect positional data of players, observers manually annotated video images of squash matches using a digitizing pad and stylus [
18,
19]. Vučković et al. [
20] built on previous work to analyze entire match play as well as at the rally level, studying the distinction between winner and losers of the rally in terms of court position, total distance, average velocity, and acceleration. In a following study, a correlation was drawn to the use of the ‘T’ area to player ability [
21]. These studies utilized the SAGIT/Squash tracking system, a real-time data acquisition tracking system which requires colored images from a birds-eye camera view positioned above the court and compares to an empty court image to determine player position [
22,
23]. Correlations have further been studied between the game outcome and rank of elite squash player by quantifying distance traveled, position relative to the T, dominance of the T, average velocities, and frequency distribution of velocities of different ranking players [
24]. Buote et al. [
24] concluded that total distance and average velocity of the player is not suggestive of the rank of the player or the outcome of the game, however, the player’s rank can indicate their ability to dominate the T and control the court. The data were collected using broadcast videos provided by the Professional Squash Association (PSA), analyzing only active match play of the full court view provided by the main camera. Using video analysis software, Dartfish Team Pro version 8 (2015), markers were manually placed on each foot for every eligible frame to determine player position in the video coordinate system. Ten reference points on the court were recorded to determine a coordinate transformation converting the video image coordinate system to the coordinate system of the plane of the court [
24].
The contribution of the current paper is to advance the development of accurate and reliable markerless motion tracking for squash by removing the need for a special setup, reducing processing times, and limiting user intervention used by previous squash studies. The proposed methodology improves on the previous work done by our research group [
24] by replacing the time-consuming and laborious task of player tracking with an autonomous deep learning based human pose estimation to detect individuals in the frame and computer vision to identify the players. Removing the need for specific equipment and limiting significant user intervention increases the number of eligible matches we can analyze in a timely manner. Matches that are filmed by the PSA or filmed similarly are available to be analyzed by our methodology. This study outlines and validates our proposed method with the results of the previous study completed by Buote et al. [
24] that quantify the players’ distance traveled, position relative to the T, and average velocities. This is the first study to apply deep learning and computer vision motion tracking techniques to study elite squash players in competition.
3. Results
The match spanned five games and 41.4 min with 22.4 min (55.3%) of active match play. An average of 76.5% of active match play was analyzed with the removal of frames using other angles than the court main camera view. A summary of game length, % of active game play, and % analyzed is reported in
Table 2.
The match was recorded at 25 fps where manual tracking included frames during active game play taken from the court main camera view [
24]. Further analysis by Buote et al. [
24] was done only between consecutive frames and did not interpolate between breaks longer than 1/25 s in time. To validate the proposed method, player detection and identification was done on the same frames. Frames were discarded by the proposed method if a player was not identified, which was usually caused by player occlusion or an unnatural pose (i.e.,
Figure 2).
Table 3 presents the number and percentage of frames retained by our methodology compared to the manual tracking.
Figure 4 compares the unfiltered and filtered (green points) (a) x and (b) y coordinates over time collected by our proposed tracking method and the manual tracking method over 2000 frames. The red circles in
Figure 4 highlight areas in which the filtered coordinates smoothed by a 5th order moving average filter can improve tracking of the unfiltered coordinates.
Unfiltered and filtered coordinates are plotted against the manual tracking coordinates in
Figure 5 with the coefficient of determination.
Table 4 outlines the R
2 values per game for both players.
Table 5 compares both players positional statistics obtained from unfiltered and filtered coordinates and Buote et al. [
24] results of the same match. These parameters are compared to the results of Buote et al. [
24] in
Table 6.
Velocity statistics including the average speed data of both players calculated from unfiltered, filtered coordinates and Buote et al. [
24] are presented in
Table 7. Similar to the positional data, average speeds were compared to Buote et al. [
24] and differences were quantified in
Table 8.
The average differences and percent error of the player data collected by the filtered coordinates is summarized in
Table 9. With consideration that the error of estimation of the position is recommended to be less than the natural balance of the center of gravity of the human body (between 15 and 20 cm) in an observed movement, the average difference for positional data of 17.6 cm (
Table 9) is acceptable but can be improved [
30].
4. Discussion
This study aims to apply deep learning and computer vision processes to evaluate kinematics of elite squash players for the first time. The method was validated when compared with previous results from a manual tracking study [
24]. Our method presents many advantages to prior data acquisition methods. The ability to analyze any matches filmed by the PSA or suitable matches filmed from a similar angle, requiring no special camera setup or wearable markers that could impede player movement, significantly increases the number of elite matches eligible for analysis.
A notable advancement in the present study is the speed of player tracking, which has been considerably accelerated to 0.3 s per frame. Player statistics are rapidly generated using Python code for easy computation. Thus, an ideal full match analysis takes approximately 3 h including tracking and analysis, where the majority of the process is autonomous. Presently, manual intervention is only required during pre-processing to identify active play. Broadcasts of professional squash matches do not have a definitive visual or auditory indicator of when a rally begins or ends, unlike other racquet sports. Based on our preliminary investigation, some strategies that could be implemented in the future to address this include tracking when the scoreboard changes, noting a change of the camera angle or pan away from the court (note that these implementations will not be completely instantaneous).
For the match analyzed, active play was slightly higher than half the total time of each game (55.9% on average). This supports the interpretation of squash being a sport demanding of short, high intensity bursts rather than endurance and constant intensity [
31]. Other camera angles such as the sidewall and close up secondary cameras do not display both players and are typically used for repetitive shots, usually drop shots or backhands down the wall from the left back corner. However, the movement of the players were cyclical between the T and the corner and deemed to be relatively equal, providing valid results for comparison and aligning with previous studies [
17,
20,
21,
24]. Current work is being done to implement autonomous collection of other frame angles of the match. Future work can be done establishing court conversion matrices using the inverse perspective mapping method used on the main camera angle and other camera angles to analyze the full length of active match play [
32,
33].
Frames analyzed by the manual tracking method were used with the proposed tracking method with an average of 83.82% of the input frames per game, resulting in detection and identification of both players. Frames where the system was unable to successfully detect players were due to player occlusion or unnatural pose as mentioned previously in the methods section. A global timestamp was assigned and detected in each frame to account for difference in time when calculating for velocity between missing frames. Comparison of time series dependent results such as player velocities provide support for the effectiveness of this approach.
Court conversions were determined using reference points in the frame specific to the court and camera angle. The equations have been noted to be more accurate in predicting a player’s position near the T than the top corners, likely due to the distribution of reference points having a higher concentration in the center (service lines) compared to the corners [
24]. The raw position coordinates were smoothed due to the variation of foot detection (described in the methods section) using a 5th order moving average filter. The R
2 values calculated displayed a slightly higher accuracy for the x unfiltered coordinates compared to the x filtered coordinates (0.990 and 0.988, respectively) (
Table 4). Further, the y filtered coordinates were noted to be considerably higher than the y unfiltered coordinates (0.971 and 0.966, respectively) (
Table 4). This indicates that the accuracy of the system depends on the margin of error of the y coordinates. This is due to the dimensions of a squash court where the length is longer than the width. Because of the camera angle perspective, the video image produces a court that is compressed lengthwise and is wider at the bottom (back wall) compared to the top (front wall), causing y coordinates to have a larger margin for error during detection.
The filtered coordinates displayed more reliability for cumulative statistics such as total distance with an average percent error of 3.73% compared to the unfiltered coordinates with an average percent error of 19.85% (
Table 6). The variation in foot detection with the proposed method resulted in larger changes in coordinate position between frames compared to the manual tracking method, which resulted in consistently higher values for total distance traveled. Filtering was able to remove the problematic fluctuations, resulting in total distance traveled values that were closer to the manually measured values. This is especially evident in Game 3, where both players have the lowest total distance percent error of 0.43% (El Shorbagy) and 6.98% (Mustonen) (
Table 6) when compared to the rest of the games. Game 3 also has the lowest number of frames collected at 76.50% as opposed to the average number of frames collected at 85.65% (without Game 3), supporting the need to filter the raw coordinates. Like previous studies, it appears players travel similar distances as their opponent in each individual game and distances traveled can be correlated to the length of game [
20].
Vučković et al. [
21] suggested that the dominance of a rally can be indicated by the time spent near the T, except for closely contested games. This is in agreement with our results as the winner and higher ranked player of the match, El Shorbagy (1.49 m for unfiltered coordinates, 1.57 m for filtered coordinates, and 1.71 m according to Buote et al.) maintained a smaller average radius to the T than Mustonen (1.71 m for unfiltered coordinates, 1.80 m for filtered coordinates, and 1.93 m according to [
24]). This is reflective of common squash tactics where skilled players play accurate shots to force their opponent to leave the T area, while less skilled players play a greater number of shots closer to the center of the court [
21,
24].
Players spent an average of 53.7% (unfiltered and filtered coordinates) of the time on the left side of the T which concurs with the findings of 56.5% from Buote et al. [
24]. Since the left side wall camera view was not analyzed, these percentages are expected to be higher. This aligns with Vučković et al. [
34] who recorded an average of 64.6% of shots coming from the left side of the court for 10 matches played at the men’s World Team Championship in 2003. As both players were right-handed, a higher percentage spent on the left (backhand) side was expected since at the elite level, a common tactic is to play to your opponent’s backhand which is considered weaker and more difficult [
24]. An overwhelming 86.4% (unfiltered and filtered coordinates) of the time was spent behind the T, agreeing with the manual tracking average of 89.7% from Buote et al. [
24]. This is similar to the previous studies of Vučković et al. who found 74.5% of shots coming from behind the T at the same 10 matches recorded at the men’s World Team Championship in 2003 as stated above [
34]. The tendency to favor the left and situate yourself behind the T typically occurs when a player returns to center to anticipate the next shot. The lower percentages calculated using the proposed method compared to the reference is likely since most missing frames due to player occlusion occur near the T during their return to the ideal position.
The average speeds calculated by the filtered coordinates (overall 1.90 m/s) are much closer to Buote et al.’s results (1.85 m/s) [
24] than the unfiltered coordinates (2.23 m/s). This supports the need for filtering of coordinates and is once again likely due to the variation in foot detection, causing increased distance traveled and in turn higher reported speeds between consecutive frames. The results of the filtered coordinates align with previous studies where Buote et al. [
24] recorded a maximum average speed of 2.04 m/s over 5 matches of elite players from 2012–2014 and Hughes and Franks [
17] recorded a maximum mean speed of 1.98 m/s, while the maximum average speed was 1.99 m/s using filtered coordinates. As the average walking speed is around 1.4 m/s and the walk-to-run transition speed has noted to occur below 2 m/s, our results of 1.90 m/s as the overall average speed reflects the idea that squash comprises of shifts between walking and running [
35,
36,
37,
38].
Removing speeds below 1 m/s is argued by Buote et al. [
24] to provide a more realistic idea of how fast players move to return shots. Speeds that fall under 1 m/s are primarily identified when a player is at center court waiting for their opponent to play a shot, during the pause for accuracy and power before a player makes their shot, and when players change directions. With this selection, our results show that for 70.2% of the time during active match play, players moved at an average speed of 2.44 m/s and only spent 29.8% of the time moving less than 1 m/s. This is reflective of Buote et al.’s analysis of 5 matches [
24] as mentioned above, which found the mean speed of players as 2.52 m/s (excluding speeds less than 1 m/s), 69.6% of the time during active match play. These speeds represent the incredible level of conditioning and endurance elite squash players must possess to compete.
A limitation of this study is the inability to analyze the entirety of active match play (83.32% analyzed on average,
Table 3). Another constraint is the assumption that players slide horizontally across the plane of the court when converting video coordinates into court coordinates, meaning that vertical movement of a player due to jumping is considered as distance traveled. In addition, the conversion does not take into account any lens warping. Our future work will focus on continuing to develop the reliability of this method, add the analysis of additional camera angles, refine the model to reduce/handle missing frames, and to gather data on recent PSA matches. Further research opportunities include analysis of upper body and arm kinematics.