Optimizing Football Formation Analysis via LSTM-Based Event Detection

Orr, Benjamin; Pan, Ephraim; Lee, Dah-Jye

doi:10.3390/electronics13204105

Open AccessArticle

Optimizing Football Formation Analysis via LSTM-Based Event Detection

by

Benjamin Orr

,

Ephraim Pan

and

Dah-Jye Lee

^*

Electrical and Computer Engineering Department, Brigham Young University, Provo, UT 84602, USA

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(20), 4105; https://doi.org/10.3390/electronics13204105

Submission received: 12 September 2024 / Revised: 2 October 2024 / Accepted: 16 October 2024 / Published: 18 October 2024

(This article belongs to the Special Issue Deep Learning for Computer Vision Application)

Download

Browse Figures

Versions Notes

Abstract

:

The process of manually annotating sports footage is a demanding one. In American football alone, coaches spend thousands of hours reviewing and analyzing videos each season. We aim to automate this process by developing a system that generates comprehensive statistical reports from full-length football game videos. Having previously demonstrated the proof of concept for our system, here, we present optimizations to our preprocessing techniques along with an inventive method for multi-person event detection in sports videos. Employing a long short-term memory (LSTM)-based architecture to detect the snap in American football, we achieve an outstanding LSI (Levenshtein similarity index) of 0.9445, suggesting a normalized difference of less than 0.06 between predictions and ground truth labels. We also illustrate the utility of snap detection as a means of identifying the offensive players’ assuming of formation. Our results exhibit not only the success of our unique approach and underlying optimizations but also the potential for continued robustness as we pursue the development of our remaining system components.

Keywords:

sports analytics; American football; formation recognition; event detection; action recognition; computer vision; machine learning; recurrent neural network; LSTM; BiLSTM

Graphical Abstract

1. Introduction

Requiring thousands of man-hours annually, the manual analysis of sports footage is a laborious process. Coaches and analysts must dedicate considerable effort to thoroughly reviewing and annotating play footage to successfully plan and execute strategies. In American football, this process is not only indispensable but particularly high-stakes, as teams that strategically play their way to victory often reap substantial rewards, both financially and otherwise.

The current game footage analysis process generally requires numerous steps: (1) manually splitting videos into short clips of plays, (2) scrubbing through each clip to identify the offensive team’s formation, (3) watching the clip multiple times to observe the outcome of the play, and (4) recording any important information by hand or using software tools. With 272 regular-season NFL games, each featuring an average of 153 plays [1], this process is repeated tens of thousands of times annually at the professional level alone, not to mention the additional time required for analysis at the minor league, college, or high school levels.

Given the grueling nature of manual football footage analysis and the high stakes associated with its accuracy, automating this process would undoubtedly be a major benefit to football coaches and analysts, ultimately providing sizeable advantages to players and the industry as a whole. However, such automation does not come without challenges; even tasks that are easy for humans are hardly as simple for even the most sophisticated algorithms.

In particular, a human spectator, from almost anywhere in a football stadium, can recognize football players and their positions, even if there are visual occlusions such as other players. On the other hand, algorithms are less capable of this level of inference; their ability to recognize players and perform other tasks hinges on the camera’s pose and the resulting footage’s overall quality—two aspects that are rarely handled with consistency.

For instance, camera positions and heights vary across different leagues and stadiums, with camera operators employing inconsistent filming techniques—some videos feature extensive panning and zooming, while others do not. Even with more advanced technologies that enable a camera system to track on-field motion and play automatically, the recorded video is often not ideal. These factors heavily influence the visibility and occlusion of players. Additionally, natural variations in weather, field lighting, and image quality further affect video appearance. Players who are occluded or filmed under poor conditions are less likely to be recognized at the same level of confidence as those captured under ideal circumstances.

Despite these challenges, there are ample opportunities for automated systems capable of executing the American football video footage analysis process. The potential for such systems becomes even more evident through the diverse studies referenced in the following sections, as well as the work presented in this study.

1.1. Related Work

The variety of work performed in the realm of automated sports analytics is extensive, including everything from machine learning for cricket match outcome prediction [2] to deep learning for analysis of swimmers’ data collected from an Inertial Measurement Unit (IMU) [3]. In this paper, we are most concerned with the automated annotation of visual sports footage, specifically the detection of multi-person events in said footage and the application of these techniques to American football.

A considerable amount of work has been done to automate the annotation of sports videos. For example, in [4,5], researchers used Support Vector Machine (SVM) classifiers alongside traditional computer vision techniques to successfully classify strokes in tennis game videos. In working towards annotation of multi-person videos, Zhang et al. achieved the state-of-the-art in tracking and identifying multiple players across multi-camera basketball videos [6]. Their approach synthesized region-based convolutional neural networks and pose estimation for identification, object segmentation for 3D localization, and a k-shortest paths algorithm for tracking.

The detection of specific action events in multi-person sports videos has likewise received considerable attention in recent years. Given the temporal nature of player motion in sports videos, recurrent neural network architectures such as LSTMs have seen widespread adoption for this purpose. In [7], Ramanathan et al. designed an attention and LSTM-based network for event detection and classification in multi-person videos, effectively identifying key individuals responsible for said events. Though adaptable to any multi-person video, they applied their models to basketball videos, evaluating them on their ability to recognize the shooter. More recently, Khobdeh et al. applied a deep fuzzy LSTM to the recognition of basketball actions such as dribble, pass, block, and shoot [8]. A similar study was completed in [9], wherein Tora et al. classified ice hockey puck possession events via AlexNet for feature extraction and an LSTM for inference.

As far as applications of multi-person video analysis and event detection to American football, perhaps the work most similar to what is presented in this paper is [10,11]. With the desired event to detect being the players assuming formation, Atmosukarto et al. calculated the frame-to-frame mean squared errors for all pixels in the video and then used an SVM for the detection of the frame with the lowest-magnitude motion. Most other closely related studies either do not focus on the task of multi-person event detection or are not specific to American football. For instance, ref. [12] presents the detection of football receivers’ catches in audiovisual data using a Convolutional Neural Network (CNN) and LSTM, while [13] details a Variational Autoencoder (VAE)-based method for event detection in soccer videos.

Considering all the aforementioned works, there remains to be a significant study that highlights the effectiveness of an LSTM for multi-person event detection in American football videos.

1.2. Our Previous Work

Our previous works provide context for much of what is described in this paper. In [14], we created a system that locates football players in an image, identifies their respective positions (roles), and then recognizes the formation of the offensive team. While the work reports generally favorable results and serves as a strong proof of concept, it also exhibits several opportunities for improvement.

The data in [14] are entirely sourced from Madden 2020, a football video game that provided easy access to clean, consistent images of pre-labeled formations. This allowed for dependable lighting, weather conditions, and image quality throughout the dataset. The resulting collection of images was easy to work with but far from representative of real-world footage.

Likewise, the methods outlined in [14] also present limitations. Notably, the detected player locations were not transformed prior to analysis by the formation recognition module. Since the images were captured at a roughly 30-degree angle (from the end of the field, behind the offense), the pixel space locations may not accurately reflect the players’ true relative distributions. Consequently, the formation recognition module was trained not with bird’s-eye view data that accurately represent player formations, but instead, skewed renderings of such.

Furthermore, our Madden 2020 dataset [14] was manually collected, with each of the 1000 images captured by hand when the players were in formation. While this labor-intensive process was feasible for one-time data collection, automating the detection of moments wherein players have assumed formation is essential for adapting the system to recognize formations given full-length game recordings. This task is the focus of this work.

In another work [15], we addressed the player location transformation issue [14] by projecting the points onto a virtual football field. This was accomplished using traditional line detection techniques with a basic convolutional neural network for number recognition, with the numbers serving as markers for precise player placement on the field. The outcome was acceptable but lacked robustness, as noted in the paper’s conclusion. For instance, the number recognition dataset consisted of only 500 images of single digits. Expanding this dataset to include double-digit numbers, like those on football field yard lines, could improve the resulting virtual field projections.

1.3. Scope of Work

The work presented in this paper is part of an ongoing effort to automate the annotation of football game footage; that is, the identification and analysis of offensive formations. The fundamental proof of concept was evaluated end-to-end and improved upon in our previous works [14,15]. In this work, we present refined versions of key system components, including player recognition and player location transformation. A critical attribute of these refinements is their application to real-world data. Additionally, we tackle one of the most time-consuming aspects of sports footage annotation—the need to scrub through videos to find players in formation—by proposing a method for automated snap detection. As the snap occurs just before players break formation, its detection serves as an effective indicator of when players are in formation.

Here, we present a system that takes a football game video clip and outputs a frame where the players are in formation (see Figure 1). Central to the work is our innovative approach to automated snap detection, but we also address the limitations of our previous works as identified in Section 1.2. Specifically, we collect three sets of real-world data, each with over 1200 samples. The integration of this data, along with updated tools not available during the completion of [14], results in a significantly improved system. Our upgraded approaches to player recognition, number recognition, and the resulting player location transformation show elevated robustness, especially when applied to real-world data (as seen in Figure 2).

The contributions of our work include the following: (1) a new approach to automated multi-person sports event detection, namely the snap in American football, and (2) a set of revisions to our pre-existing techniques for player recognition, number recognition, and player location transformation. The real-world results of these tasks not only stand for themselves but also provide an unparalleled foundation for future development of the remaining portions of our system’s pipeline: differentiation of offense and defense, player position identification, and formation identification.

This paper is organized as follows: Section 2 describes our methods for each relevant stage of the system pipeline; Section 3 details the results of our work; Section 4 includes a discussion of our system as whole, with an emphasis on remaining improvements still to be made; and Section 5 concludes the paper.

2. Methods

The input to our system is a football game video clip. For the detection of an event such as the snap, players must be tracked throughout the video, so we begin by recognizing both players and numbers using two YOLOv8x detection models [16]. Alongside lines that we detect using traditional techniques, the numbers are used to place the player locations onto a virtual field in bird’s-eye view, allowing for their consistent in-field localization for the duration of the video, regardless of the camera’s panning. This is essential for successful formation detection as well as future formation identification. Lastly, we autonomously detect the moment wherein players assume formation using an LSTM network. We can postprocess the network outputs to retrieve both the coordinates and images of the players in formation. Figure 3 provides a visual overview of this sequence.

2.1. Player Recognition

Any automated analysis of player formations requires prior knowledge of where the players are located. To recognize the players and extract their locations, we utilized YOLOv8x, the latest and largest stable implementation of the YOLO (You Only Look Once) object detection framework at the time of writing [16,17]. Compared to YOLOv3 (the version used in our previous work [14]), this architecture achieves a more than 63% improvement in mean average precision (mAP) when evaluated on the COCO dataset [18]. Such improvements made it the obvious choice for use in our upgraded pipeline.

Our custom dataset for training the player recognition model consists of 1402 images captured from footage of 9 football games. We meticulously labeled more than 30,000 player instances using the makesense.ai (https://www.makesense.ai/ accessed on 12 September 2024) object labeling tool [19], prioritizing accuracy and precision throughout the process. The goal of our intense effort in the labeling process was to ensure that player locations extracted from the model’s predictions would be as accurate as possible (see Figure 4 for an example of our exactness in labeling). We additionally emphasized labeling each player, even when they were obscured by others. This allowed us to ensure our future model would have a higher recall across videos, as football footage captured from moderate heights is likely to have a high frequency of obscured players.

We trained our model using the Ultralytics Python package [16]. With the exception of batch size and number of epochs—determined automatically by GPU resource limits and early stopping—all hyperparameters were set to default. The data were randomly split 80/20 into training and validation sets, with 1119 images in the training set and 283 images in the validation set.

2.2. Transformation of Player Locations

When extracted from the YOLO model’s predictions, player locations are not localized within the football field but are in the domain of an image’s pixel space. Therefore, the next vital step in our pipeline is the transformation of player locations to a virtual field in bird’s-eye view. This ensures that, no matter where the camera pans during a video, we can know where on the field players are located. Additionally, player formations can be more accurately assessed when viewed from above, as opposed to the sides of the field.

Our revisions to the player location transformation stage are found in our approaches to the underlying number recognition step. Just as in the player recognition stage (see Section 2.1), these revisions include a new custom dataset and the usage of a YOLOv8x model.

Our dataset for training the number recognition model consists of 2154 images captured from the same 9 football games used for the player recognition dataset. Our labeling process was identical in terms of tools used [19] and effort expended, with the only notable difference being the presence of more than one class in the data.

The model was trained in the same manner as in the previous stage, with the Ultralytics package [16] and mostly default hyperparameters. The data were randomly split 80/20, with 1794 images used for training and 360 images used for validation.

2.3. Snap Event Detection

Pinpointing the moment when formations are assumed is vital to the overarching goal of automating offensive formation recognition and analysis, as manually scrubbing through hours of video to locate a seconds-wide moment in time is troublesome and time-consuming. In our preliminary research, we found that rule-based algorithms incapable of generalization would be unfit for handling the erratic nature of our frame-by-frame player location data. We consequently conceived the proposed artificial intelligence-based approach, namely an LSTM for event detection.

The event of interest is the snap: the passing of the ball that occurs at the beginning of a play from scrimmage. However, at this stage in the pipeline, we work not with video data where the ball can be detected, but with bird’s-eye view player locations mapped onto a virtual field. Therefore, ‘snap’ refers not to the literal passing of the ball, but to the subsequent burst of activity discernible in the player location data. Given that detectable eruptions in player movement follow periods of stillness, it is clear that identifying these eruptions is a plausible approach to the detection of still players in formation.

Central to our detection method is the LSTM, a type of recurrent neural network capable of capturing relevant dependencies over long periods of time [20]. Learning to selectively maintain or forget information, they are well suited for analyzing sequential data such as the kind we had available at this stage in our pipeline. The LSTM is described in greater detail in Section 2.3.2.

2.3.1. Event Detection Dataset

The dataset used to train our event detection model originated from the same nine football games used to train the player and number recognition models. Having player locations extracted, transformed to bird’s-eye view, and projected to a virtual field, our event detection dataset contains the localized, overhead player locations for every frame of almost every video clip from the nine games, for a total of 1281 sequences. The sequence lengths range from 312 to 2549 frames. Video clips with no manually observable snap event were discarded.

Preparation of this dataset involved manually splitting each sequence into three segments (classes): pre-snap, snap, or post-snap. Intuitively, frames in the snap class are those where the offensive team has assumed formation and is just about to snap. As video clips contain only one play each, and all of the frames before and after the snap segment are classified as pre-snap and post-snap, respectively. Figure 5 shows an overview of our labeling methodology, wherein each frame is individually grouped into one of three contiguous segments. Allowing the model to infer the transition between classes, this method of labeling encourages detection of not just the snap itself—i.e., the movement following the passing of the ball—but the assuming of formation beforehand, which we also classify as snap.

After labeling, we augmented our dataset from 1281 to 2745 sample sequences. This augmentation allowed us to achieve three objectives: (1) create a test set using original data evenly sampled from the nine games, (2) triple the size of the remaining training data, and (3) maintain a random 80/20 split between training and test sets. The data were augmented at the sequence level through a combination of shifting and scaling. Specifically, player locations in the entire sequence were shifted in both the x and y directions by up to ±5% of the virtual field’s dimensions. Following this, they were randomly scaled by up to ±10%.

We finalized our data preprocessing by normalizing all points relative to the dimensions of the virtual field, scaling values to the range of [0, 1]. Because the player locations had been projected onto the virtual field, this normalization was straightforward and also preserved the spatial relationships of their on-field locations. This technique enhanced the model’s learning efficiency and helped maintain the stability of gradients.

2.3.2. LSTM Network

While the LSTM is not unique in its ability to learn patterns from sequential data, we found it to be more suitable for our task compared to other architectures. A transformer encoder required more data than were available, and a gated recurrent unit (GRU) did not perform as well on our data. What separates the LSTM from many others is its capacity for maintaining long-term dependencies in its cell state. Utilizing three unique gates—each a layer with weights, biases, and an activation function—it learns what to forget, retain, and output (see Figure 6).

In this work, we adopted a bidirectional LSTM for our event detection task. This variant possesses all the benefits of the base architecture but processes sequences in both forward and backward directions, equipping the model with contextual information from both before and after each time step [22].

Our implementation had four layers and a hidden state size of 128, meaning

x_{t}

(an item in sequence x at time step t) is processed by eight separate LSTM cells (four in each direction), connected sequentially via cell and hidden states that can each contain 128 features. To combat overfitting, the first three layers were also connected to dropout layers, with a dropout probability of 0.3. Figure 7 shows the complete block diagram of our network.

This architecture was used in tandem with the cross entropy loss function. We additionally implemented a smoothing parameter of 0.025 to introduce a finite amount of noise to the ground truth labels. Enabling the model to be more adaptable in the presence of ambiguous transitions between class labels [23] helped to address the challenge of determining when events begin and end (also discussed in [7]).

Lastly, the final output of our LSTM is mapped to one of the three classes (pre-snap, snap, post-snap) by a single linear layer and an

a r g m a x

operation. Accommodating the doubled size of the bidirectional LSTM’s final concatenated hidden state

h_{t}

, the linear layer is of input size 256 and output size 3. The network was trained with a batch size of 16 on an NVIDIA Tesla P100 (NVIDIA, Santa Clara, CA, USA). We used the Adam optimizer with a learning rate of

1 \times 10^{- 3}

and weight decay of

1 \times 10^{- 5}

.

3. Results

3.1. Player Recognition

After training for 491 epochs, our player recognition model achieved an F1 score of 96% when evaluated on the test set. Performing consistently well across confidence values of up to 0.9, this is a dramatic improvement over the comparable stage in our previous work, which reached only 90.3% accuracy on the dataset manually acquired from the Madden 2020 football video game [14]. Figure 8 shows the F1–confidence curve for our model.

It is worth noting that our new model not only performed at a remarkably higher level than our previous work, but it also did so on real-world data. Compared to the data used in [14], this set is more representative of the world and thus makes the system trained on it more capable of generalization. Refer to Figure 2 for a visual comparison of the two datasets.

Furthermore, the major enhancement to our player recognition is found in both the quantity of players recognized as well as the quality of predictions. Our model’s predicted bounding boxes being as tight and centered as possible (around even the occluded players) made for extra accurate extractions of player locations, laying a strong foundation for the improvement of the subsequent transformation stage.

3.2. Transformation of Player Locations

Much like the player recognition stage, our updated player location transformation stage displays results that greatly surpass that of our previous work. The foundation of these results was shown by our refined capacity for player recognition (see Section 3.1). As our underlying points became more accurate, so did the resulting transformations.

Additional enhancement was found in our new approach to number recognition. Evaluating our model on the test data, we attained an F1 score of 97% after training for 354 epochs. Figure 9 shows the consistency of this result for confidence values of up to 0.9. We again emphasize the significance of our improved result (over our previous number recognition accuracy of 96% [15]) on a dataset that is (1) larger, (2) collected from the real world, and (3) utilizing whole yard lines as opposed to just single digits.

Figure 10 demonstrates the high quality of player transformations resulting from our updated work in the prerequisite steps. The players’ locations are deduced, transformed, and projected in a robust manner that provides placement on the virtual field that is far more accurate than our previous work [15]. Furthermore, the capacity for generalization and reliability of our player transformation stage has increased; the high degree of quality presented in Figure 10 is pervasive throughout the entire set of results.

3.3. Snap Event Detection

We designed our event detection model to make predictions on the data in the same manner as they were labeled. Specifically, each input sample is a sequence (i.e., one video clip’s worth) of player location frames, with each frame in the sequence receiving a predicted label of pre-snap, snap, or post-snap. As in the preliminary labeling (see Figure 5), this effectively splits the sequence into three segments. Assuming sequence x of length

c + d + e

, we can represent a sequence’s labels y as shown in Equation (1).

y = \underset{c times}{\underset{︸}{0, 0, \dots, 0}}, \underset{d times}{\underset{︸}{1, 1, \dots, 1}}, \underset{e times}{\underset{︸}{2, 2, \dots, 2}}

(1)

where c, d, e ≥ 1 and 0, 1, 2 represent pre-snap, snap, and post-snap, respectively.

To assess these predicted sequences, we adopted the Levenshtein similarity index (LSI), a standardized metric for sequence comparison based on the Levenshtein distance [24]. Also known as edit distance, the Levenshtein distance calculates the minimum number of single-element edits (deletions, insertions, and substitutions) required to make two sequences identical [25], with a pair of matching sequences having a Levenshtein distance of zero. For an analytical explanation, see Equation (2).

l e v_{a, b} (i, j) = \{\begin{matrix} max (i, j) & if min (i, j) = 0, \\ min \{\begin{matrix} {lev}_{a, b} (i - 1, j) + 1 \\ {lev}_{a, b} (i, j - 1) + 1 \\ {lev}_{a, b} (i - 1, j - 1) + 1_{a_{i} \neq b_{j}} \end{matrix} & otherwise . \end{matrix}

(2)

where:

a, b are sequences

i is the index in sequence a

j is the index in sequence b

The LSI (or Levenshtein distance similarity [26]) is calculated by normalizing the Levenshtein distance by the length of the longer sequence and then taking the complement (see Equation (3)). Contrary to the Levenshtein distance, an LSI of zero indicates no similarity, while a pair of matching sequences has an LSI of one.

L S I (a, b) = 1 - \frac{l e v_{a, b} (| a |, | b |)}{max {| a |, | b |}}

(3)

where:

|a| is the length of a

|b| is the length of b

After training for 100 epochs, our event detection model achieved an exceptional average LSI (see Equation (3)) of 0.9445. In other words, the predicted sequences and their ground truth counterparts on average had normalized Levenshtein distances (differences) of less than 0.06. This remarkable result illustrates the robustness of our system and its potential for saving thousands of hours for coaches of American football.

As discussed in Section 2.3.2, the decision to use bidirectional LSTM for our event detection task was not made without prior experimentation. Other suitable architectures for learning sequential data include the gated recurrent unit (GRU) and transformer encoder, both of which we trained and tested on the same data. The GRU was implemented bidirectionally with the same network dimensions and hyperparameters as the LSTM. The transformer encoder implementation was relatively simple: two layers, four heads, and a feedforward/hidden dimension size of 32. Table 1 contains the resulting LSIs of these other architectures.

Compared to the GRU, the LSTM exhibited a modest but meaningful improvement in performance, likely due to its higher parameter count. In contrast, the transformer encoder showed signs of overfitting, while the LSTM’s simpler architecture allowed for better generalization.

As satisfactory as our results are, we were unable to perform one-to-one comparisons with other works due to fundamental differences in methodologies, metrics, and datasets. As discussed in Section 1.1, refs. [10,11] detail the studies that are most similar to this one, involving a method for detecting the formation frame. While we perform this task using a series of player positions extracted from videos, their approach relies on a series of mean squared error pixel differences between video frames. The lack of publicly available code and datasets renders it unfeasible to directly apply their methods to our dataset or our methods to theirs. We acknowledge the similar utility of both our method (0.9445 LSI) and theirs (94.53% accuracy), but we cannot conduct a thorough comparison between them.

In addition to quantitatively validating our event detection sequence predictions, we are also interested in visualizing our detection of the players’ assuming of formation, as this goal is what motivated the task of snap event detection. Each frame in a sequence (i.e., one video clip’s worth) of player location data is classified as pre-snap, snap, or post-snap, effectively segmenting the sequence into three contiguous blocks (see Figure 5 and Equation (1)). To visualize our event detection model’s usefulness in detecting offensive players in formation, we can identify the middle frame of a predicted snap segment (which contains the stillness of players before the snap) and extract the corresponding frame from the original video clip. An example of one of these frames is shown in Figure 11, where players were successfully detected by the model to be in formation and about to snap.

4. Future Work

For both our system in particular and the field of multi-person event detection in sports video, the opportunities for further development are ripe. The final goal of our system is to input a video of game footage and automatically generate a report that provides both granular and high-level analyses of the game. Currently, we can input a video clip from a game and determine when the offensive players have assumed formation. There remain several steps that we must take in order to complete our system. These include, but are not limited to, the extraction of offensive players’ locations from the snap window, the identification of personnel (i.e., player position), and the identification of offensive formations.

An MLP or other network architecture could be quite useful in extracting offensive players’ locations. Assuming detection of the line of scrimmage (separating line), each team could be separately classified into offense or defense. Once this is completed, the position (role) of each offensive player must be identified; that information is then used for formation identification. We expect that both of these tasks could also be completed with simple MLP networks.

Once those steps are completed, this pipeline must be applied to not only short video clips but also video footage of entire games. That could require automatically slicing videos in a manner similar to the event detection stage presented in this work. Doing this will allow us to input an entire game video and create a log of every formation assumed during the game, effectively paving the way for the future task of generating a game summary report.

One of the many nontrivial challenges faced in this work is that of player occlusion resulting from non-ideal camera angles. Repeating the various stages of our pipeline using footage taken from a higher angle would thus be a worthwhile endeavor in obtaining even better results, as footage taken from higher camera angles is likely to have fewer occluded players and other obstructions. An alternative solution to this challenge is to use the same low-angle, real-world footage but reinsert players that go undetected due to occlusion. This could be accomplished manually or even autonomously via a wide array of algorithms. Such an improvement could increase the accuracy of our future stages.

Additionally, this work has massive potential for expansion to other sports, such as soccer, where methods for multi-person event detection are already established. Adapting our approach to these sports and analyzing performance differences would be a worthwhile endeavor toward the development of this technology and the automated sports analytics field as a whole.

5. Conclusions

In this work, we introduced a unique LSTM-based method for detecting the snap in American football, attaining an LSI of 0.9445. This high similarity score achieved on real-world data validates the effectiveness of our approach and the vast improvements made to our preprocessing steps. The implications of our approach for the objective of detecting players in formation were also visualized. Moreover, our results indicate the capacity of this technology for various applications, particularly other sports which likewise emphasize the analysis of multi-person motion. This work additionally lays a resilient foundation for the ongoing development of our overall system, which aims to generate statistical reports from full-length football game videos autonomously. The potential of such a system is significant, as it could substantially impact the work of thousands of coaches by saving them countless hours and thereby increasing player effectiveness.

Author Contributions

Conceptualization, E.P. and D.-J.L.; Methodology, B.O. and D.-J.L.; Software, B.O.; Validation, E.P.; Formal analysis, E.P.; Investigation, B.O.; Data curation, E.P.; Writing—original draft, B.O.; Writing—review & editing, E.P. and D.-J.L.; Visualization, B.O.; Supervision, D.-J.L.; Project administration, D.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

NFL. NFL Officials: Preparing for Success. 2024. Available online: https://operations.nfl.com/officiating/nfl-officials-preparing-for-success/ (accessed on 12 September 2024).
Kapadia, K.; Abdel-Jaber, H.; Thabtah, F.; Hadi, W. Sport analytics for cricket game results using machine learning: An experimental study. Appl. Comput. Inform. 2022, 18, 256–266. [Google Scholar] [CrossRef]
Delhaye, E.; Bouvet, A.; Nicolas, G.; Vilas-Boas, J.P.; Bideau, B.; Bideau, N. Automatic Swimming Activity Recognition and Lap Time Assessment Based on a Single IMU: A Deep Learning Approach. Sensors 2022, 22, 5786. [Google Scholar] [CrossRef] [PubMed]
Zhu, G.; Xu, C.; Gao, W.; Huang, Q. Action recognition in broadcast tennis video using optical flow and support vector machine. In Proceedings of the Computer Vision in Human-Computer Interaction: ECCV 2006 Workshop on HCI, Graz, Austria, 13 May 2006; Proceedings 9. Springer: Berlin/Heidelberg, Germany, 2006; pp. 89–98. [Google Scholar]
Shah, H.; Chokalingam, P.; Paluri, B.; Pradeep, N.; Raman, B. Automated stroke classification in tennis. In Proceedings of the Image Analysis and Recognition: 4th International Conference, ICIAR 2007, Montreal, Canada, 22–24 August 2007; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2007; pp. 1128–1137. [Google Scholar]
Zhang, R.; Wu, L.; Yang, Y.; Wu, W.; Chen, Y.; Xu, M. Multi-camera multi-player tracking with deep player identification in sports video. Pattern Recognit. 2020, 102, 107260. [Google Scholar] [CrossRef]
Ramanathan, V.; Huang, J.; Abu-El-Haija, S.; Gorban, A.; Murphy, K.; Li, F.-F. Detecting Events and Key Actors in Multi-Person Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Khobdeh, S.B.; Yamaghani, M.R.; Sareshkeh, S.K. Basketball action recognition based on the combination of YOLO and a deep fuzzy LSTM network. J. Supercomput. 2024, 80, 3528–3553. [Google Scholar] [CrossRef]
Tora, M.R.; Chen, J.; Little, J.J. Classification of puck possession events in ice hockey. In Proceedings of the 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 147–154. [Google Scholar]
Atmosukarto, I.; Ghanem, B.; Ahuja, S.; Muthuswamy, K.; Ahuja, N. Automatic recognition of offensive team formation in american football plays. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 991–998. [Google Scholar]
Atmosukarto, I.; Ghanem, B.; Saadalla, M.; Ahuja, N. Recognizing team formation in American football. In Computer Vision in Sports; Springer: Cham, Switzerland, 2014; pp. 271–291. [Google Scholar]
Hollaus, B.; Reiter, B.; Volmer, J.C. Catch Recognition in Automated American Football Training Using Machine Learning. Sensors 2023, 23, 840. [Google Scholar] [CrossRef] [PubMed]
Karimi, A.; Toosi, R.; Akhaee, M.A. Soccer event detection using deep learning. arXiv 2021, arXiv:2102.04331. [Google Scholar]
Newman, J.; Sumsion, A.; Torrie, S.; Lee, D.J. Automated Pre-Play Analysis of American Football Formations Using Deep Learning. Electronics 2023, 12, 726. [Google Scholar] [CrossRef]
Wright, K.; Torrie, S.; Orr, B.; Lee, D.J. Video Preprocessing for American Football Formation Recognition. In Proceedings of the 2024 Intermountain Engineering, Technology and Computing (IETC), Logan, UT, USA, 13–14 May 2024; pp. 102–107. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 September 2024).
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Skalski, P. Make Sense. 2019. Available online: https://github.com/SkalskiP/make-sense/ (accessed on 12 September 2024).
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Chevalier, G. LSTM Cell. 2018. Available online: https://commons.wikimedia.org/wiki/File:LSTM_Cell.svg (accessed on 12 September 2024).
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Helweg, D.; Cato, D.; Jenkins, P.; Garrigue, C.; McCauley, R. Geographic Variation in South Pacific Humpback Whale Songs. Behaviour 1998, 135, 1–27. [Google Scholar] [CrossRef]
Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
Yang, L.; Fu, S.; Luo, Y.; Wang, Y.; Zhao, W. A Clustering Method of Encrypted Video Traffic Based on Levenshtein Distance. In Proceedings of the 2021 17th International Conference on Mobility, Sensing and Networking (MSN), Exeter, UK, 13–15 December 2021; pp. 1–8. [Google Scholar] [CrossRef]

Figure 1. An abstract depiction of the work described in this paper. Part of a larger system, our pipeline takes a football game video clip as input and outputs the point in the clip where the offensive players are in formation.

Figure 2. An example of the stark differences between our previous dataset [14] (left) and one of our new datasets (right). The previous dataset is a collection of images captured from the Madden 2020 video game. The new datasets were sourced from real-world game footage. The real-world data contain variations in weather, lighting, perspective, and image quality.

Figure 3. An overview of our system pipeline. The input is a video clip of a football game. The locations of players and numbers are recognized using YOLOv8x models, and field lines are detected using traditional computer vision techniques. The lines and number locations are used to transform the player locations into bird’s-eye view and project them onto a virtual football field. The localized player locations are then input to our event detection model (LSTM), which detects when the players are in formation.

Figure 4. A snapshot of the manual labeling process for our player recognition dataset. To ensure accurate player locations, we hand-labeled each image with the utmost attention to detail, focusing on tightly positioning bounding boxes around each player. This included players who were not 100% visible to the camera—a common occurrence in our dataset and other similar datasets. We completed our manual labeling using the makesense.ai object labeling tool [19].

Figure 5. Examples of labels in our event detection dataset. The dataset consists of preprocessed sequences of player locations extracted from video clips. By visualizing five frames from a single sequence, we can clearly understand the labeling methodology. Player locations were used to assign a label to each frame, dividing the sequence into three distinct segments: pre-snap for frames before formation, snap for frames during formation and the snap, and post-snap for frames after formation breaks.

Figure 6. A diagram of a single LSTM cell, based on [21]. Taking in a current input

x_{t}

and previous cell and hidden states

c_{t - 1}

and

h_{t - 1}

, the cell makes use of four layers of weights (comprising three gates) to learn what to forget, retain, and output at each time step. New cell and hidden states

c_{t}

and

h_{t}

are then passed to the next cell or used in the next time step, as appropriate.

Figure 6. A diagram of a single LSTM cell, based on [21]. Taking in a current input

x_{t}

and previous cell and hidden states

c_{t - 1}

and

h_{t - 1}

, the cell makes use of four layers of weights (comprising three gates) to learn what to forget, retain, and output at each time step. New cell and hidden states

c_{t}

and

h_{t}

are then passed to the next cell or used in the next time step, as appropriate.

Figure 7. The block diagram for our event detection network. Consisting of four bidirectional LSTM layers, each sequence x is serially processed in both directions simultaneously by two sets of four LSTM cells (see Figure 6): by the forward set as

x_{0}

,

x_{1}

…

x_{T - 1}

and the backward set as

x_{T - 1}

,

x_{T - 2}

…

x_{0}

. Each of the eight cells has a unique set of weights and states. The cells are connected via dropout layers of probability 0.3 and hidden and cell states of size 128. For each time step t, the final hidden states of both sets of LSTM cells are concatenated to form the final output

h_{t}

, now of size 256. This is input into a single linear layer, which outputs the probabilities for each of our three classes.

Figure 7. The block diagram for our event detection network. Consisting of four bidirectional LSTM layers, each sequence x is serially processed in both directions simultaneously by two sets of four LSTM cells (see Figure 6): by the forward set as

x_{0}

,

x_{1}

…

x_{T - 1}

and the backward set as

x_{T - 1}

,

x_{T - 2}

…

x_{0}

. Each of the eight cells has a unique set of weights and states. The cells are connected via dropout layers of probability 0.3 and hidden and cell states of size 128. For each time step t, the final hidden states of both sets of LSTM cells are concatenated to form the final output

h_{t}

, now of size 256. This is input into a single linear layer, which outputs the probabilities for each of our three classes.

Figure 8. The F1–confidence curve for our player recognition model. This plot shows F1 scores for the whole range of confidence values, where F1 is the harmonic mean of precision and recall, and confidence is a measure of the model’s certainty about its predictions. We achieved an F1 score of 96% at confidence values of up to 0.9, demonstrating notable improvements over the 90.3% accuracy of this stage in our previous work [14].

Figure 9. The F1–confidence curve for our number recognition model (see Figure 8 for an explanation of F1 and confidence). Each thin line represents the F1 score for an individual output class, while the thick dark blue line shows the average F1 score across all classes. Examining this curve, we see that the model consistently performed with an F1 score of 97%. Reaching this result on a dataset that is larger and more representative of the real world shows great improvement over the 96% accuracy of our previous work [15].

Figure 10. An image of players alongside their locations transformed to bird’s-eye view and projected to a virtual field. Notice the highly accurate positioning of each player’s processed location. These quality results are the product of our enhanced approaches to both player and number recognition.

Figure 11. A frame that corresponds to the exact middle of the predicted snap portion of a sequence in the test set. As intended, the offense team is shown to be in formation and ready to imminently snap the ball.

Table 1. The Levenshtein similarity index (LSI) of event detection predictions made by different model architectures. All trained and tested on the same data, the three methods can be easily compared, highlighting the clear superiority of our proposed bidirectional LSTM.

Architecture	LSI
Gated recurrent unit (GRU)	0.9330
Transformer encoder	0.8462
Proposed method (LSTM)	0.9445

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Orr, B.; Pan, E.; Lee, D.-J. Optimizing Football Formation Analysis via LSTM-Based Event Detection. Electronics 2024, 13, 4105. https://doi.org/10.3390/electronics13204105

AMA Style

Orr B, Pan E, Lee D-J. Optimizing Football Formation Analysis via LSTM-Based Event Detection. Electronics. 2024; 13(20):4105. https://doi.org/10.3390/electronics13204105

Chicago/Turabian Style

Orr, Benjamin, Ephraim Pan, and Dah-Jye Lee. 2024. "Optimizing Football Formation Analysis via LSTM-Based Event Detection" Electronics 13, no. 20: 4105. https://doi.org/10.3390/electronics13204105

APA Style

Orr, B., Pan, E., & Lee, D. -J. (2024). Optimizing Football Formation Analysis via LSTM-Based Event Detection. Electronics, 13(20), 4105. https://doi.org/10.3390/electronics13204105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Football Formation Analysis via LSTM-Based Event Detection

Abstract

1. Introduction

1.1. Related Work

1.2. Our Previous Work

1.3. Scope of Work

2. Methods

2.1. Player Recognition

2.2. Transformation of Player Locations

2.3. Snap Event Detection

2.3.1. Event Detection Dataset

2.3.2. LSTM Network

3. Results

3.1. Player Recognition

3.2. Transformation of Player Locations

3.3. Snap Event Detection

4. Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI