Can We Extract Value from Missed Shots in Football?

Using Generative Models to Extract Valuable Insights from Missed Football Shots.

Jan 02, 2025

1 Introduction

Effectively evaluating shooting skill is a key challenge in soccer analytics. Historically, the finishing skill of players has been compared using classical statistics such as goals and shots on target, or more advanced models derived from these statistics. While these metrics are easily obtainable, they suffer from small sample sizes and lack comparability across players since the probability of scoring or landing a shot on target heavily depends on the context of the shot. Consequently, a significant focus in soccer analytics has been the development of advanced shooting metrics that are both stable over time and comparable across players.

One of the most prominent advancements in this domain is the expected goals (xG) model, which addresses comparability by assessing the probability of scoring given specific contextual variables at the time of the shot, such as shot location, proximity of other players, and body part used. Derived metrics like goals above expectation (GAX) aim to quantify shooting skill by subtracting a player's expected goals from their actual goals. However, GAX suffers from limited empirical stability, with season-to-season variations that undermine its reliability.

Post-shot expected goals (PostXg) offer a refinement by incorporating the spatial trajectory of shots after they are struck. Metrics such as expected goals added (EGA), calculated by subtracting PostXg from pre-shot expected goals (PreXg), enhance stability but still face challenges related to small sample sizes. Furthermore, current shooting metrics focus exclusively on on-target shots, ignoring the trajectories of off-target shots. This oversight is significant given that off-target shots constitute the majority of attempts in soccer, representing between 57% and 65% of all shots. As a result, these metrics fail to leverage the full dataset of shot trajectories, leaving valuable information untapped.

Drawing parallels from basketball, where including the trajectories of all shots improved estimates of field goal percentage, research suggests a similar potential in soccer. Evidence indicates that incorporating the trajectory information of all shots, including off-target ones, could yield more stable and informative metrics. For instance, near-misses that narrowly miss the upper corners of the goal likely reflect better shooting skill than wild misses.

This paper introduces a hierarchical generative model for soccer shot trajectories, leveraging this model to propose two novel metrics that incorporate the trajectories of all shots, including off-target attempts. These metrics demonstrate improved stability over existing measures and offer predictive value for future player performance, enhancing the evaluation of shooting skill.

2 Data

The analysis draws on data from StatsBomb, comprising 77,315 shots across six international leagues over 15 seasons. These leagues range in ranking from 7 to 31, encompassing high-level professional players but excluding elite-tier competitions. This ensures comparability across leagues while capturing a broad spectrum of professional play. Table 1 provides an overview of the leagues and seasons included in the dataset.

“**Table 1:** Leagues and seasons included in our dataset. Rankings are per Ackerson (2022) on January 23, 2022.”

2.1 Features Utilized

The following features of each shot were used in the analysis:

Player who attempted the shot
Shot outcome (Saved, Goal, Off Target, Blocked, or Post)
Coordinates of the shot start (x, y) and end (x, y, z) locations
Body part used for the shot (Right Foot, Left Foot, Header, or Other)
Estimated PreXg and PostXg for the shot, derived from StatsBomb models

2.2 Data Preprocessing

The dataset underwent several preprocessing steps to ensure accuracy and consistency:

Bias Correction: Adjustments were made to correct a bias in shot end coordinates near the goal frame.
Trajectory Imputation: For saved and blocked shots, trajectories were projected to estimate where they would have crossed the goal line had they not been obstructed. This involved linear projections for horizontal paths and proxy z-coordinate estimations.
Exclusion Criteria: Shots taken from within 6 yards of the goal were excluded to eliminate attempts that primarily rely on force and luck.
Symmetry Exploitation: Left-footed shots were mirrored across the y-axis to leverage the inherent symmetry in soccer data.

“**Figure 1:** Adjusted shot end coordinates from 2018 MLS data, coloured by outcome. Note that each point may represent multiple shots, as the data are collected on a discrete grid.”

3 Generative Model for Shots

The proposed shooting metrics are underpinned by a generative model designed to capture the spatial dynamics of soccer shot end coordinates. Specifically, the model focuses on the coordinates – denoted as (y, z) – where the ball crosses the plane of the goal line. These coordinates are modeled using a mixture of bivariate Gaussian distributions, truncated to ensure that all z-coordinates remain non-negative. The choice of bivariate Gaussian distributions is supported by empirical studies that demonstrate variability in soccer shots and their utility in modeling spatial dynamics in other sports such as darts and tennis.

“**Figure 2:** Comparison of original and projected shot end coordinates for saved shots from 2020 USL data.”

To enhance the model’s ability to generalize across players, a hierarchical component is incorporated, allowing individual player parameters to be informed by a global prior. This approach provides a flexible framework to understand variations both within and between players.

3.1 Global Mixture Model Components

The generative model begins with a global mixture model designed to encapsulate the range of potential shot locations. A dense set of fixed mixture components is created, representing possible intended shot end locations within the goal frame. These components serve as a coarse parameter space over which inference is conducted.

Each component mean corresponds to an equidistant grid point within the goal frame, which control the resolution of the grid in the horizontal and vertical directions, respectively. Additionally, each component is associated with a nominal covariance matrix to capture execution error around these locations. Scaling factors are applied to generate multiple covariance matrices for each mean, creating a diverse set of components.

Hyperparameters for the model are determined through experimentation, balancing computational efficiency with accuracy. After evaluating various configurations, the final model includes 132 components distributed across 66 distinct mean locations, with each mean location associated with two covariance matrices. A symmetric Dirichlet prior is employed to estimate the global weights for these components, ensuring no inherent bias toward any specific component. Components with negligible weights are pruned to reduce complexity, leaving a parsimonious subset of active components.

“**Figure 3:** Means and covariances of components used in trimmed shot generative model. The size of the circles corresponds to the value of λ associated with the component, while the opacity reflects the weight associated with the component.”

3.2 Hierarchical Mixture Model

Once the global mixture model is established, the hierarchical framework is employed to estimate player-specific weights over the reduced set of components. This approach accounts for individual differences while leveraging the global prior to guide inference. The likelihood function incorporates the truncated bivariate Gaussian distributions of the active components.

The model is fit using variational inference to approximate the posterior distributions of player-specific parameters. This choice balances computational feasibility with the complexity of the hierarchical model. The resulting framework provides a detailed representation of each player’s shot distribution while maintaining computational efficiency.

4 Proposed Shooting Metrics

The generative model serves as the foundation for two novel shooting metrics: Rao-Blackwellized PostXg (RBPostXg) and Generalized PostXg (GenPostXg). These metrics aim to enhance the stability and predictive power of shooting skill evaluations by incorporating data from all shot trajectories, including off-target attempts.

4.1 RBPostXg: Rao-Blackwellized PostXg

RBPostXg estimates a player’s probability of scoring based on their shot trajectory data, conditioned on the parameters of their generative model. By incorporating both the distribution of a player’s shots and the likelihood of scoring given the end coordinates, this metric provides a robust assessment of shooting skill.

The metric is calculated by integrating over all possible shot end coordinates within the player’s generative model. Each component of the mixture contributes to the overall probability based on its weight and the scoring likelihood for its associated end coordinates. This integration is performed using Monte Carlo methods to estimate the expected scoring probability for an arbitrary shot from the player.

4.2 GenPostXg: Shot-Specific Generalization of RBPostXg

GenPostXg expands on RBPostXg by assigning a value to each individual shot, including off-target attempts. This metric evaluates the theoretical probability of scoring for each shot, considering the range of potential outcomes that could result from the player’s shooting distribution.

For a given shot with end coordinates, the probability that it belongs to a specific mixture component is calculated. These probabilities are then combined with the component-specific scoring likelihoods to compute the shot’s GenPostXg value. By averaging these values across a set of shots, a player’s overall GenPostXg score is derived.

4.3 Coordinates-Based PostXg Model

Both RBPostXg and GenPostXg rely on a PostXg model that predicts scoring likelihood based solely on shot end coordinates. This model is developed using logistic regression with a polynomial expansion of the coordinates. The resulting scoring contours align with intuitive expectations, assigning higher probabilities to shots near the corners of the goal frame and lower probabilities to less favorable locations.

While the contour model demonstrates reasonable accuracy, its performance is slightly lower than proprietary models that incorporate additional contextual features such as shot speed and goalkeeper positioning. Nonetheless, the simplified model offers transparency and compatibility with the generative framework, enabling the proposed metrics to effectively evaluate shooting skill.

“**Figure 4:** Contour of PostXg(y, z) model. Regions where a shot is more likely to score are shown in brighter colours, whereas regions where a shot is less likely to score are shown in darker colours. Note that all shot end coordinates outside the goal frame have a PostXg value of zero.”

5 Results

In this section, the outcomes and implications of the proposed metrics are presented, demonstrating their effectiveness in analyzing player shooting patterns. Although the dataset used for this study is not publicly accessible, the codebase implementing these methods is open-source and available online.

5.1 Stability and Predictive Value

To assess the practical utility of the proposed metrics, their predictive value for future player performance was evaluated. This was achieved by examining the correlation between players' performance in the first and second halves of a season. Stability, in this context, refers to how well a metric predicts a player's future performance based on past observations.

A comparison of these metrics reveals that the benchmark metrics exhibited very low stability, with correlation values not exceeding 0.056. By contrast, the proposed metrics demonstrated improved stability, with RBPostXg achieving a correlation of 0.136 and GenPostXg reaching 0.162. Additionally, these new metrics outperformed the benchmark metrics in predicting future performance, underscoring their ability to capture valuable signals of player shooting skill.

“**Table 2:** Inter-season correlation of metrics from first to second half-season. For example, the correlation between players’ GAX in the first half of a season to their EGA in the second half of a season is 0.026. The italicized metric names are our proposed new metrics. The best predictor of each metric in the second half of the season is bolded.”

To address potential biases introduced by low-sample-size players, further analysis was conducted by including only player-seasons with at least 40 shots. Under this condition, the stability of benchmark metrics decreased even further, while the proposed metrics maintained significantly higher stability, reaching a peak correlation of 0.232 for GenPostXg. This finding indicates that the new metrics effectively extract consistent signals from shot end coordinates. Furthermore, the predictive power of the proposed metrics for future performance on benchmark metrics further validates their utility in measuring shooting skill.

“**Table 3:** Inter-season correlation of metrics from first to second half-season for player-seasons with 40 or more shots. This table presents the same information as Table 2 after filtering to player-seasons with at least 40 shots. The italicized metric names are our proposed new metrics. The best predictor of each metric in the second half of the season is bolded.”

Figure 5 illustrates the inter-season correlation of each metric as a function of the sample size threshold for shots in a player-season. The proposed metrics consistently outperform GAX and EGA across all thresholds, achieving inter-season correlations of nearly 0.3 for thresholds of 40 shots or more. In contrast, the benchmark metrics exhibit negligible stability regardless of sample size. This highlights the ability of the proposed metrics to reliably identify skilled shooters even with smaller datasets, a significant advantage for soccer analysts.

“**Figure 5:** Inter-season correlation of metrics as function of player sample size threshold. The sleeves show 90% confidence intervals estimated via bootstrapping.”

These results are particularly relevant for talent evaluation. A team that identifies skilled players earlier than its competitors gains a strategic advantage in recruitment and transfer markets, leading to both on-field success and financial benefits. A sensitivity analysis, detailed in the appendix, confirms the robustness of the results to preprocessing decisions, such as shot distance filtering and coordinate reflection.

5.2 Player Analysis

The mixture model component weights for individual players offer valuable insights into their shooting habits. Figure 6 displays the component weights for selected player-seasons alongside the global weights estimated across the entire dataset. While player-specific weights generally align with global patterns, notable deviations highlight unique shooting tendencies.

“**Figure 6:** Component weights for selected player-seasons. The x-axis indexes the components of the mixture model, and the order is arbitrary. The global mixture model weights are shown in the leftmost bar of each group, in purple.”

For instance, in 2018, Sebastian Giovinco exhibited an unusually high weight for a component near the top-left corner of the goal. This information could enable opposing coaches to better prepare defenders and goalkeepers to anticipate Giovinco’s shots. Similarly, analysis of Zlatan Ibrahimovic’s 2019 performance revealed a lower-than-average weight for an accurate bottom-left corner component and a higher weight for a less precise counterpart. Such insights could guide targeted training to improve his shooting accuracy when aiming for that part of the goal.

6 Conclusion

This study introduces a hierarchical generative model for soccer shot coordinates, leveraging truncated bivariate Gaussian distributions to capture shot patterns. The player-specific parameters derived from this model enable teams to diagnose individual shooting habits, prepare for opponents, and identify areas for improvement.

Additionally, two novel shooting metrics were proposed, which incorporate off-target shot trajectories to offer a more comprehensive evaluation of player skill. These metrics outperform traditional metrics in terms of stability and predictive value, making them particularly useful for recruitment and player valuation. Their ability to provide reliable estimates of shooting skill even with limited sample sizes represents a significant advancement in soccer analytics, equipping teams with better tools for decision-making in both tactical and financial contexts.

Be a Team Player — Pass It On!

Baron, E., Sandholtz, N., Pleuler, D., & Chan, T. C. (2024). Miss it like Messi: Extracting value from off-target shots in soccer. Journal of Quantitative Analysis in Sports, 20(1), 37-50. https://arxiv.org/abs/2308.01523