1. Introduction
The identification of performance factors, understood as variables that define some aspect of performance and that help achieve sports success [
1], is essential to try to identify the most appropriate behavior patterns that can lead to success [
2] and enable the increase and prediction of performance [
3,
4]. The analysis of the matches will identify those variables related to success [
5], and the grouping and combination of these success indicators of different nature will allow the construction of football performance profiles [
4,
6]. To obtain both the indicators and the performance profiles, the discriminant analysis of the game between teams of different levels is a very useful tool. However, we are facing a sport of a complex and dynamic nature, which makes the identification of these performance profiles a very difficult task [
7] because the success of the game can be associated with multiple factors (physical, technical, tactical, …), some of them being unpredictable or uncontrollable, such as arbitration decisions, individual successes or failures of players, match location, type of competition or even chance.
Football research has turned to a multitude of performance indicators [
8], and some studies have tried to identify them through the comparative analysis of successful and unsuccessful teams [
9,
10,
11,
12,
13,
14,
15,
16,
17]. Some of these works show conflicting results. This may be caused, among other things, by the type and size of the sample, the study design, the selection of the variables and the characteristics of the sport itself. It may also be because most studies identify the success of the teams based on the match outcome [
9,
16,
18,
19,
20,
21,
22,
23]. This discrimination criterion can cause erroneous results because in this sport, in some matches, the team with the best statistical data does not end up getting the victory since in football a single winning play style does not exist. Several teams with different play styles can get similar results. Therefore, it will be necessary to classify the teams, instead of the match outcome, by their position at the end of the season.
To study KPI and performance profiles in football, it would be necessary to perform nomothetic analysis instead of an ideographic one, as the latter would identify the behavior patterns of a unique team and not of the game. It is necessary, therefore, to conduct a longitudinal analysis of all the teams and matches corresponding to one or several regular seasons and classify the teams according to their final position and not based on the match outcome. In this way, the KPI will be more reliable because they will be less mediated by the factors indicated above, and the teams that obtained a higher performance (higher score) at the end of the season can be explained by the fact that they maintained a more effective behavior. Nevertheless, there are few previous studies in this line [
11,
24,
25,
26,
27,
28].
Sometimes, to carry out this type of works, especially when indirect observation methodology is used, we find a very extensive data matrix with many related variables. In this case, it would be beneficial to reduce this matrix for a simpler interpretation and eliminate possible redundant information. However, if the reduction is carried out under some subjective criteria, there is a risk of losing relevant information. Therefore, we need some tool that allows us to objectively reduce the dimensions of a data matrix without losing important information. For this, PCA can be an adequate statistical technique since its aims are to simplify, reduce and structure the initial information obtained [
29]. Its application to the tactical analysis of football has been demonstrated in various works with satisfactory results. Specifically, Gómez et al. [
30] carried out a study with the aim of identifying the independent and interactive effects of the game location and the final result in the statistics related to the football game according to the area of the field in which they occurred in LaLiga, from 2003 to 2004 and 2007 to 2008 seasons. They identified different profiles in the teams related to the match venue and the match outcome. In the work of Moura et al. [
31] two main components were identified in the 2006 World Cup and showed that shots, shots on goal and percentage ball possession are some variables that discriminate among winning, drawing and losing teams. Winter and Pfeiffer [
23] identified four dimensions in the UEFA Euro 2012 (game speed, transition play after ball recovery, transition play after ball loss and offense efficiency), concluding that the transition play after losing the ball and the offense efficiency seem to be factors connected directly with the match outcome, as those were important values for a successful discrimination. In [
32], the specific aim of their paper was to investigate which factors were most crucial for the match outcome in the Serie A, concluding that shot on target is the performance indicator of the game. In the work of Ric et al. [
33], a comparative study of the spatial individual and collective organization of the players was carried out between the first and second half of the game. In the work of Fernández-Crehuet et al. [
34] an index was built to measure the performance of Spanish Football league teams, during the 2016/2017 season, combining five dimensions: economic, fans-related, historical, team quality and the season’s data. Authors in [
35] managed to identify and differentiate various styles of play of the different teams of the Chenesse Soccer Super League during the 2006 season. One style of play denominated possession, other denominated set pieces attack, counterattacking play and, finally, transitional play.
Therefore, we have not found previous works that the PCA have applied to tactically analyze LaLiga teams, during several seasons, and that have determined the level of performance based on the position they occupied in the leaderboard at the end of the season. Nor have they identified and used components to develop a performance model of the teams of different levels. Consequently, we decided to carry out this study to pursue the following aims: the first aim of the present study was to reduce the size of a large database and group it into new categories without losing information, through the PCA. The second aim was to perform a comparative and predictive performance analysis among the best and bottom teams of LaLiga, using the KPI of each group.
4. Discussion
To identify the indicators that influence football performance we perform a comparative analysis between teams of different levels of success, but sometimes we find a set of data with many related categories; therefore, the application of techniques that reduce the quantity of data could be useful. In this work we have considered reducing the dimensions of a data matrix without the loss of relevant information, using PCA. Subsequently we have used these PCs to try to identify the difference in performance between the best and bottom teams of LaLiga.
The PCA data mining technique allowed reducing the dimensions of a broad set of original categories without losing information, creating new categories for both groups. Specifically, we managed to reduce the original data matrix, composed of 57 categories in less than 10 new categories and with an explanation of the variance of ≥70%, enabling the grouping of information and the simplification of the analysis. For the best teams group, PCA created 8 PCs that explained 70.1% of the variance (
Table 1): Passes (0.27%); Challenges (0.15%); Attack effectiveness (0.08%); Shots (0.07%); Dribbles (0.04%); Tackles (0.04%); Offensive transitions (0.03%) and Possession (0.03%). For the bottom teams group, 9 PCs were created that explained 70% of the variance (
Table 3): Passes (0.23); Challenges (0.14%); Shots (0.08%); Attacks effectiveness (0.06%); 1vs1 (0.05%); Dribbles (0.04%); Fouls opponent (0.04%); Possession (0.03%) and Offensive transitions (0.03%).
In both groups, the Passes PC is denominated this way because most of the categories that constitute it refer to the number of passes and the time of possession. Challenges PC received this name because it included all types of challenges. The Attack effectiveness PC collected categories of the offensive phase, especially related to goals, shots and the effectiveness of shots. The Shots PC mainly included categories related to goals, shots, possession and passes. The Dribbles PC was mainly constituted by categories referring to dribbling, tackles and challenges. The Tackles PC was related to dribbling, challenges, tackles and lost balls. The Offensive transitions PC received this name for being related to recoveries, interceptions and counterattacks. The Possession PC, in the group of successful teams, is the one that showed a worse definition since it is made up of categories with less relation between them. In the bottom teams the 1vs1 PC included all dribbles and tackles. Fouls opponent PC is constituted by varied categories, being the heaviest ones the fouls opponent and, finally, Possession PC is also formed by different categories, the time of possession being the most important.
Therefore, the PCA was shown, as in some previous works [
23,
31,
33,
34,
35,
45,
46,
47], as a good statistical technique, when we intend to reduce large data sets that have many interrelated variables, allowing us not only to speak of individual performance indicators, but of a set of related indicators.
If we use the PCs to compare the game of both groups, the first difference we observe is that, to explain the same percentage of variance, for the best teams group we need eight PCs, and for the bottom teams we need nine PCs. In both groups, both the category constituted from PC and called Passes, as well as Challenges, were those that allowed explaining the highest percentage of the variance. The Passes category had a slightly greater weight (27%) in the best teams group than in the bottom teams (23%) (
Table 2 and
Table 4). On the other hand, Challenges showed a similar weight in both groups (15% and 14%). However, the loadings of each PC were not exactly the same for each group (
Table 3 and
Table 5). Thus, for Passes PC in the best teams group, the most important categories were passes, passes accurate, passes accurate left and passes accurate right. For the bottom teams, the highest weight categories were possession, passes, passes forward, passes left, passes right and passes forward accurate. Therefore, we can indicate that successful teams are characterized more by the efficiency of the passes than by the number of passes executed. That is, they have a greater number of successful passes than lower level teams. These results coincide with some previous works [
22], but they analyzed 2014 Brazil FIFA World Cup and used a logistic regression. For Challenges PC we have also found some differences. It can be seen how, for the best teams, the attack challenges had greater weight; however, the defensive challenges were the ones most relevant for bottom teams. This circumstance can be explained because the bottom teams are characterized by staying longer in the defensive phase, executing many more defensive than offensive actions. Previous work also coincides in indicating that the successful teams show higher averages of offensive variables, and unsuccessful teams show higher averages of defensive variables [
48].
Another difference that we can see in terms of PC formation is that in the best teams the PC called Tackles is formed, consisting mainly of the categories dribbling, challenges, tackles and lost balls. In the bottom teams the 1vs1 PC and fouls were constituted but did not appear in the other group. In spite of these differences we can appreciate that both the components constituted for both groups, as well as the categories and the weight of these in each component, were very similar. This circumstance leads us to think that in high level football the differences between the teams are minimal, and their success or failure may be explained by the individual performance of their players.
The results of the linear regression model (
Table 5 and
Table 7) allow us to identify which PCs have the greatest influence on the performance of both groups of teams. For this, a prediction model of the category “EFFECTIVENESS” was built, both for the best and for the bottom teams. The linear regression model of the best teams group, ordering the PCs from highest to lowest weight, was constituted as follows: Attack effectiveness (0.76260); Offensive transitions (0.40160); Shots (0.36481); Possession (0.33451); Dribbles (0.22498); Passes (0.03851); Challenges (−0.23416) and Tackles (−0.40160). In the bottom teams the order was as follows: Dribbles (0.55955); Possession (0.51367); Shots (0.30873); Passes (0.17582); Challenges (0.07051); 1vs1 (−0.18340); Fouls (−0.54061) and Attack effectiveness (−0.84295). We can see how in best teams, the PC that offered a greater influence on the prediction of this category was Attack effectiveness. The number of goals, a greater ball possession time in the final third of the field, a greater number of effective shots and crosses allow to increase the performance in best teams. This information is essential for technicians since, if they manage to improve the performance of their teams in these elements of the game, they will increase their offensive performance. The information provided by the number of goals is trivial since it is obvious that scoring more goals implies increasing offensive performance, but the other indicators referring to ball possession zone, effective shots and crosses do offer transcendent information. These results are corroborated by the works of [
9,
10] who indicated that successful teams have longer-term possessions in the middle of the offensive field than the defensive one. The works [
19,
22,
49,
50] indicate that successful teams show greater effectiveness in shooting, also ratify in their work that making a greater number of crosses increases the chances of winning the matches. In contrast to the cited studies, in our work we have obtained similar results using a different method, specifically through a data mining technique. Winter and Pfeiffer [
23] also reached the same conclusion in their work, indicating that there is a relationship between offense efficiency and success, but they analyzed UEFA Euro 2012 and considered success as the match outcome.
Following the results of the linear regression model, we can indicate how the main differences in the prediction of performance of both groups occur in PCs offensive transitions, tackles, challenges, dribbles, fouls opponent and 1vs1. Offensive transitions play a more important role in the best teams than in the bottom teams. Thus, in the best teams, performing a greater number of recoveries, interceptions and counterattacks, that is, dynamic offensive transitions through counterattacks, would increase their performance in the game. This circumstance was also pointed out by Tenga et al. [
40]. These authors analyzed the Norwegian league, and by means of a multiple linear regression, they obtained that the proportion of goals scored during counterattacks (52%) was higher than during elaborate attacks (48%). Therefore, the offensive game seems to be more efficient against a disorderly defense. This information is very important for the coaches, who should focus their training on these game situations, both in attack and defense, to try to improve their performance in both phases of the game.
In the best teams the Tackles and Challenges PCs negatively influenced the offensive performance. This may be due to the fact that these are more typical behaviors of unsuccessful teams, as indicated above [
48].
In bottom teams it was appreciated how increasing the number of successful dribblings would increase performance. This result coincides with that of the work of Harrop and Nevill [
21] who found that the number of dribbles is correlated with performance. The PC Fouls opponent also showed a strong negative influence on the performance of bottom teams and that these teams showed fewer effective attacks than the best teams.
We have achieved the aims set and the sample used, as these are the matches of three competitive seasons, allowing us to generalize the results. The main contribution and novelty of this work is that we have carried out a longitudinal tactical analysis of LaLiga teams, using the combination of factor analysis and linear regression. However, we believe that the differences found in the constitution of the different PCs have not been as satisfactory as we would have liked. We believe that this may be due to the design used, in our case we have found the PCs for each group of teams separately and, subsequently, we have tried to build a probabilistic model with the detected PCs. In future works we should propose a design in which we find the main components for both groups and then build a separate model for each group. In addition, since the goals scored and received did not have a significant contribution to the main components, in the future it could be considered to eliminate these variables from the analysis because this approach may be biasing the same.
The results of this work offer information to the technicians, about what are the KPIs in football and the game pattern of the best teams, being able to compare the latter with that of their own teams, and thus, to be able to make the appropriate modifications, to increase performance.