Then, we thoroughly describe the comprehensive methods employed for constructing features, including the use of heatmaps for extracting key features and various feature engineering techniques to create new variables. We also offer an in-depth explanation of the newly introduced features, discussing their relevance and significance in the context of our model. Additionally, this section includes a summary of the features, highlighting their importance, and examines the correlations among them to enhance the model’s predictive performance. Moreover, we outline the data preprocessing steps and normalization techniques used to prepare the dataset for modeling. These insights help inform the decisions made in the subsequent stages of our analysis and modeling.
3.1. Data Sources and Tools
Data serves as the foundation for any machine learning project, playing a critical role in the accuracy and effectiveness of analysis and predictions. The success of research outcomes is highly dependent on both the quantity and quality of the data used. In this study, we utilized a dataset from Understat, a well-known professional soccer website, with data gathered by analyst Edd Webster, who has publicly shared it on his GitHub repository [
28,
29]. Our dataset includes match events from the Top 5 European leagues and the Russian Premier League, covering the 2014/2015 to 2021/2022 seasons. It comprises 21,678 observations, offering a detailed perspective on player and team performance over multiple years. As a result of this time frame, individual players can appear multiple times, leading to a total of 6359 unique players in the dataset. Our dataset consists of 29 features related to individual players, reflecting their performances in matches, constructed using three distinct methods. Initially, we extracted 14 core features from Understat, which included well-established soccer metrics such as goals, assists, and expected goals (
xG). Moreover, we developed another 13 features through feature engineering. These metrics are fundamental in evaluating player performance and are widely recognized in soccer analytics [
10,
23,
29]. In addition to the 27 features derived from extraction and engineering, we introduced two new features called
position_weight and
league_weight to further refine model accuracy. These new additions were carefully selected to offer additional insights and enhance the overall predictive power of our models. A detailed explanation of the feature construction process is provided in the following sections.
Furthermore, for this research, we primarily use Python (version 3.11.7) due to its user-friendly nature and extensive range of libraries. We leverage several Python libraries, including scikit-learn (version 1.4.2), which provides a wide array of pre-built algorithms and tools for data preprocessing, model training, and evaluation [
30]. Additionally, we use Pandas (version 2.1.4) for efficient data manipulation and analysis [
31], while NumPy (version 1.26.4) serves as the foundation for numerical computing, supporting multi-dimensional arrays, matrices, and a variety of mathematical functions [
32]. For visualization, we utilize Seaborn (version 0.12.2), an advanced visualization library built on top of Matplotlib, which simplifies the creation of informative and aesthetically pleasing statistical visuals such as heatmaps and box plots [
33]. Matplotlib (version 3.7.5) itself offers a diverse range of chart and plot options, enabling detailed customization and seamless integration with other libraries for effective data visualization [
34]. By combining these tools and libraries, we efficiently process, analyze, and visualize data, thereby enhancing the overall quality and depth of our research findings.
3.2. Building Features for ML Models
In this subsection, we present three distinct methods for constructing features to build our machine learning models.To facilitate a clearer understanding of these methods, we provide accompanying tables and graph that illustrate our approaches and highlight the significance of the features constructed through each method.
3.2.1. Heatmap-Based Feature Extraction
To extract features from Understat [
29] we utilize a systematic approach that include leveraging heatmaps to identify additional correlated features. The heatmap shown in
Figure 2 visually represents the correlations between the various features selected from our dataset. This visualization helps in understanding the strength and direction of relationships among features, with color intensity indicating the degree of correlation. We set a predefined correlation threshold, typically ±0.5, to guide our selection process. Features with coefficients greater than 0.5 were considered to have a significant positive correlation, while those less than −0.5 indicated a significant negative correlation. By applying these criteria, we ensure that only features with strong relationships to the target variable are included namely
xGg and
aGg, thereby enhancing our model’s predictive power. The selected features include fundamental metrics such as
games and
goals which are crucial for assessing player productivity and scoring efficiency. Moreover,
xG and
assists were included for their roles in quantifying scoring opportunities created and converted. Metrics like
xGChain and
xGBuildup were also chosen to measure a player’s involvement in goal-scoring sequences and play-building activities. Furthermore, the player ID serves as a unique identifier within the dataset. Generally, it is not regarded as an informative feature since it does not provide statistical insights into performance or characteristics.
These features were carefully selected based on their logical relevance and statistical significance, aiming to improve both the accuracy and robustness of our predictions. This approach facilitates a deeper understanding of player performance across various soccer leagues, reflecting a nuanced interpretation of the underlying data dynamics.
3.2.2. Feature Engineering
In addition to feature extraction, we implemented feature engineering to introduce 13 additional features categorized by their relevance to player performance: scoring efficiency, playmaking abilities, disciplinary behavior, and advanced metrics. Feature engineering enhances the dataset by creating new features that provide deeper insights into player performance. This process involves transforming raw data into meaningful inputs that improve model performance, predictive accuracy, and the overall understanding of player dynamics. For instance, metrics like aGg (actual goals per game) help assess a player’s scoring efficiency by indicating the average number of goals scored per game, offering insights into offensive productivity and identifying top-performing players. Another engineered feature, gpm (goals per minute), reveals the rate at which goals are scored during game time. This aids in assessing a player’s impact throughout the match and enhances the precision of goal-scoring predictions. Moreover, apg (assists per game) sheds light on a player’s playmaking abilities by showing the average number of assists provided per game. This metric not only highlights individual performance but also contributes to understanding collaborative efforts within the team, thereby enriching insights into team dynamics. In terms of disciplinary behavior, features like ypg (yellow cards per game) and rpg (red cards per game) provide insights into a player’s disciplinary record and aggression level. Monitoring these metrics helps assess their impact on match outcomes and team dynamics, considering potential suspensions or player availability issues. Advanced metrics like xGdiff (expected goals difference) further enrich the analysis by evaluating a player’s goal-scoring potential relative to the quality of scoring opportunities.
Additionally,
Table 2 offers a detailed summary of all features, including those obtained through different methods discussed in the subsequent subsections. This table presents both descriptions of the features and their correlation coefficients with the target variables, as detailed in the analysis.
3.2.3. Introduction of New Features
To complement the features obtained through heatmaps and feature engineering, we introduce new features, such as
league_weight, to further improve the performance of our model. This feature enhances our dataset by capturing the variability in player performance across leagues, allowing the model to make informed predictions within a broader competitive context. Moreover, the
league_weight feature is used to differentiate between leagues, which is essential for avoiding bias. It is important to account for varying levels of competition, quality, and popularity across leagues. Each league has a unique history of success in international tournaments like the UEFA Champions League, influencing its competitive standard [
35].
Furthermore, the quality of leagues varies based on player skill levels and team performance, necessitating league-specific adjustments for more accurate insights. These adjustments are essential due to factors such as differing histories of success in international competitions like the UEFA Champions League and varying overall quality levels based on the caliber of players and teams [
35]. Additionally, fan engagement and attendance rates can vary widely from league to league [
36]. By applying different ratios to each league, we can adjust for these discrepancies, accurately reflecting each league’s actual strength and unique characteristics. This differentiation enables fairer comparisons between players and teams, allowing our models to accommodate variations in league quality and other factors, ultimately leading to more precise and reliable soccer.
The formula for calculating
league_weight is as follows:
where the variable
i stands for each league. The weight
is assigned to historical information, while
is assigned to fan attendance. Moreover, hist_score represents the normalized historical score for each league, and fan_score is the normalized fan attendance score for each league.
According to our formula,
league_weight is calculated using two main factors: historical success and fan attendance. By including these two factors, we make sure the analysis considers both the competitive strength and the popularity of each league. A higher historical score shows strong international performance and a higher standard of competition, while a higher fan attendance score reflects more commercial appeal, financial resources, and the ability to attract top players. Each league’s score can go up to a maximum of 10 points, with a perfect score indicating the highest levels of fan support and historical success. Using these weights helps us account for differences between leagues, allowing a more accurate representation of each league’s true strength and unique features. This approach ensures fairer comparisons between players and teams from different leagues by considering differences in league quality and other key factors. Overall, this method provides more reliable and precise soccer analysis, ensuring our models give valid insights across various levels of competition. In addition, we introduce another feature called
position_weight. Player position significantly affects a player’s
xG and
aG as well as the overall impact on the game. Forwards, who focus on scoring goals, usually have higher xG compared to defenders. Meanwhile, goalkeepers and defenders have specific duties; goalkeepers concentrate on making saves, while defenders aim to stop the opposition from scoring [
37].
Our dataset includes players whose primary positions are goalkeepers, defenders, midfielders, and strikers. Additionally, over the years, certain players have transitioned between various positions on the field. To accurately capture these shifts and their impact, we developed a feature that categorizes players into specific positional groups. We then assign weight to each category based on the players’ anticipated contributions to scoring goals and influencing game outcomes. This approach allows us to account for positional versatility and better understand how players’ roles evolve and affect their overall performance and team dynamics. For example, we grouped players into categories such as midfielder and striker, defensive striker, and forward, midfielder, and striker, among others. Each group received a weight based on its influence on xG. Positions related to scoring goals, such as forward and striker, received higher ratios of 10, highlighting their key role in creating scoring chances. In contrast, positions with defensive or goalkeeping duties, such as defenders and goalkeepers received lower ratios of 4 and 0.1, respectively, reflecting their focus on preventing goals rather than scoring. Ratios for mixed roles, like defender, forward, and midfielder or forward and midfielder were adjusted to account for their diverse responsibilities across different aspects of the game. By incorporating position_weight, we adjust our analysis to account for the various responsibilities and expected outcomes of different player positions. This feature offers a more comprehensive understanding of player performance and potential across different roles on the field. It also enables more precise comparisons between players in similar positions, leading to deeper and more meaningful insights into the game.
3.3. Feature Summary and Correlations
In summary, this section presents a comprehensive overview of the features included in our dataset, which were collected using the three methods discussed above.
Table 2 shows a comprehensive summary of all features included in the analysis, detailing both descriptions and their correlation coefficients represent the average of the two target variables,
xGg and
aGg.
The table highlights the various features and their correlations, demonstrating their influence on the model’s predictions. Features 1 to 14 were collected using a heatmap, with their respective correlation values directly related to this method. For instance, the goals feature shows a strong correlation of 0.85, indicating its significant impact on evaluating player performance. Similarly, xGhas a high correlation of 0.91, underscoring its essential role in outcome prediction. Other notable features from this group include shots with a correlation of 0.80, playing a crucial role in the model. Conversely, the remaining features, from 15 to 27, were developed through feature engineering, and their correlation values with the target variable are also included in the table. For example, gpm (goals per minute) and shpm (shots per minute) have correlations of 0.54 and 0.53, respectively, offering valuable context by capturing different aspects of gameplay. Additionally, league_weight and position_weight exhibit moderate correlations of 0.55 each. While these features may have lower individual importance, they enhance the model by providing supplementary insights that, when combined with other metrics, improve overall predictive accuracy. This table categorizes the model’s variables into input and input and target features. Input features reflect a player’s past performance, capturing historical data as foundational information. Input and target features, however, combine historical data with predictive relevance, directly contributing to the model’s target outcomes. Metrics like aGg and xGg not only summarize past achievements but also enhance predictions, making them essential for performance evaluation and forecasting.
Moreover,
Table 3 illustrates the feature importance for both
xGg and
aGg. This table provides a detailed comparison of how different features contribute to the predictions of
xGg (expected goals per game) and
aGg (actual goals per game). To determine feature importance, we employed SHAP (SHapley Additive exPlanations) values, which provide an in-depth understanding of how different features influence the model’s predictions. In the context of
xGg, the most influential features include
aGg (goals per game) and
gpm (goals per minute), which contribute 33.82% and 33.31% to the model, respectively. Other significant features are
shpg (shots per game) with 6.43%, and
apg (assists per game) with 4.91%. These features indicate that scoring efficiency and shooting metrics are critical for predicting expected goals per game. Conversely, for
aGg, the most impactful feature is
xGg (expected goals per game), which has a dominant importance of 60.61%. This suggests that the expected goals metric is highly predictive of the actual goals scored. Other important features for
aGg include
gpm (goals per minute) with 10.90%, and
shpm (shots per minute) with 10.46%. These contributions highlight the relevance of shooting metrics and goals efficiency in predicting actual goals per game. While some features, such as
xGChain and
xGBuildup show relatively low importance percentages (below 1% for
xGg and below 0.5% for
aGg), they still play a role in the overall model. For instance, features like
xGChain and
xGBuildup capture aspects of play that, despite their smaller individual impact, can affect the prediction accuracy when combined with other features.
Table 4 presents a comprehensive overview of player performance metrics, offering valuable insights into the intricate dynamics of soccer. The dataset reveals that players have participated in an average of 3214 games, indicative of their substantial careers and commitment to the sport. This longevity is a testament to the players’ resilience and adaptability, essential qualities in the highly competitive environment of professional soccer.
In terms of on-field activity, players typically contribute around 18.76 min of impactful play per match. This metric underscores the significance of efficiency, as athletes must maximize their performance within the constraints of limited playing time. Notably, the dataset records an average of 1317 goals scored, which encapsulates the myriad moments of skill and achievement that characterize these athletes’ careers. Complementing this, the expected goals (xG) statistic, averaging 1.75, indicates players’ proficiency in finding scoring opportunities, highlighting their ability to create potential goal-scoring situations.
Teamwork is also crucial in soccer, as evidenced by the average of 1.80 assists per player. This statistic reflects the collaborative nature of the sport, showcasing players’ capabilities to facilitate scoring opportunities for their teammates. Furthermore, an impressive average of 16.58 key passes per player illustrates the role of playmakers who significantly influence the game’s outcome through strategic ball distribution and offensive orchestration. However, the competitive nature of soccer presents its challenges. The data reveal an average of 12.24 yellow cards per player, suggesting the intense nature of matches where discipline can be as vital as scoring. The variability observed in metrics, including shots per game (0.73) and expected goals per game (0.08), encapsulates soccer’s unpredictable essence, where individual talents often shine through the complexities of game scenarios.
Moreover,
Table 5 shows a sample from the dataset. The table summarizes compiled data without additional calculations.
It includes various performance metrics for soccer players, such as a unique identifier for each player, the number of games played, total minutes on the field, goals scored, and assists provided. The dataset also features advanced statistics like expected goals (xG), which measures the quality of scoring chances, and expected assists (xA), which estimates the likelihood of a pass leading to a goal. Additional metrics include the total number of shots taken, key passes made (those that lead to a shot), yellow cards received, and averages for shots per game (shpg) and expected goals per game (xGg).