Finding the Perfect Replacement
Leveraging a Spatial Similarity Index to Compare Players Across the Field.
The following summary critically reviews the research paper titled "Spatial similarity index for scouting in football" by Virgilio Gómez-Rubio, Jesús Lagos and Francisco Palmí-Perales. All data, figures, and analysis presented here are drawn from their original work; I do not claim any authorship or ownership of the content. This summary has been written to provide a concise and technically informed synthesis of the paper’s findings, methodologies, and implications, while maintaining fidelity to the authors’ intellectual contributions.
1. Introduction
While player statistics are widely available, spatial data, routinely visualized as heatmaps, remains underutilized in profiling. This data, which reflects a player's positioning on the field, provides implicit cues about their tactical function but is rarely leveraged systematically for similarity analysis.
Football data falls broadly into two categories: eventing and tracking [19, 20]. Event data captures discrete actions like fouls or passes, annotated with player identities and locations, and is predominantly manually recorded. Tracking data, on the other hand, is collected via wearable devices or computer vision systems and can yield continuous positional variables, such as distance covered or movement speed.
The majority of recent studies favor quantitative approaches, particularly those employing machine learning. For example, [21] uses rule-based systems and predictive models to extract and structure inter-player relationships from performance statistics. Similarly, [31] applies multiple machine learning techniques, implemented via the Rminer package, to derive insights into player performance.
Beyond prediction, machine learning has been used to quantify a player’s "fit" within a team. In [5], a random forest approach produces a composite “fit” variable, aligning with the broader trend of reducing multi-dimensional performance into composite indicators [23, 24].
Some studies rely on multiple regression models to relate performance variables [27]. Yet, few incorporate positional data directly into model structure. Notable exceptions include [11], which models pass effectiveness through Ridge regression using location, speed, and angle data, though model fit remained limited.
A body of work has emerged around spatial modeling in football. This includes applications of motion models for player and ball positions [7, 18], as well as simpler linear models like ANOVA to capture passing trends [30]. However, most spatial methods stop short of formal spatio-temporal modeling. Instead, they treat coordinates as covariates or compute summary metrics [6, 15].
Advanced spatio-temporal modeling remains rare. One notable exception is [32], which introduces a Transformer-Based Neural Marked Spatio-Temporal Point Process (NTPP) for modeling the timing and type of in-game events. Another is [17], which offers a Bayesian marked point process that estimates event probabilities and simulates match outcomes, enabling inference on expected goal values.
Despite these contributions, "spatial methodologies in football analysis remain scarce," with few studies proposing structured frameworks to analyze spatial player behavior, particularly for scouting purposes.
To address this, the paper proposes a spatial similarity index derived from spatial cross-correlation techniques. Specifically, a football field is partitioned into a lattice of uniform cells, each representing a spatial unit where a player’s activity is quantified (e.g., time spent). By applying Lee’s statistic [14, 16], a spatial similarity score is computed that reflects how closely two players occupy comparable zones on the pitch. This method, although illustrated using football data [2], is designed to generalize to other invasion-based team sports.
2. Spatial data analysis
Spatial data analysis concerns datasets in which each observation is tied to a location [3, 8]. In this context, the football pitch constitutes the spatial domain, which is discretized into regular units, typically rectangles or squares. For each unit, variables such as time spent, player density, or event counts may be recorded.
Figure 1 illustrates such a discretization, where “each area is represented by a polygon,” and adjacency between areas is also defined. For the present application, emphasis is placed on the average time spent per area by each player, serving as a proxy for spatial occupation. Figure 2 further visualizes heatmaps for five players, each representing a spatial footprint that informs player comparisons from a scouting perspective.
Spatial data typically exhibits spatial autocorrelation, where values in neighboring regions are not independent. That is, “areas that share a boundary… will have similar values.” This inherent spatial structure necessitates specific statistical tools.
Neighborhood relationships are encoded via a binary adjacency matrix. This structure is further generalized by spatial weights matrices W, which may assign continuous weights to encode varying strength of spatial relationships [3].

2.1. Spatial autocorrelation
A widely used measure for global spatial association is Moran’s I, which quantifies the similarity of a variable across neighboring areas:
Positive values of Moran’s I indicate clustering of similar values (high-high or low-low), while negative values suggest dissimilar values between neighbors (high-low). A value near zero implies spatial randomness.
In football, spatial autocorrelation is expected due to the smooth and role-dependent movements of players across zones.
2.2. Spatial cross-correlation
To compare spatial distributions between two variables, the study employs Lee’s statistic, a bivariate generalization of Moran’s I:
This statistic captures the degree to which two spatial variables (e.g., two players' heatmaps) co-vary across adjacent zones. Positive values indicate that high (or low) values in one variable tend to align with high (or low) values in the other, implying similar field occupation. Conversely, negative values suggest opposing spatial profiles.
Importantly, the method includes a statistical test of significance based on the null hypothesis of spatial independence. Since scouting focuses on matching player roles, only positive spatial correlation is relevant, i.e., when both players tend to occupy similar zones.
3. Spatial similarity
To quantify spatial similarity between players, the study employs Lee’s statistic [14] on their positional distributions across a discretized pitch. The primary input is the proportion of time each player spends in each area, chosen to mitigate scale differences arising from unequal playing time or activity levels.
Although the raw Lee’s statistic or its standardized form (i.e., the Z-score) could be used directly, the authors opt for the associated p-value from a one-sided statistical test under the null of spatial independence. This choice ensures a bounded, scale-independent similarity measure. Specifically, small p-values indicate strong positive spatial correlation (i.e., similar positioning patterns) whereas large p-values reflect no or negative association.
This p-value serves as a pseudo-distance between players: values near 0 imply spatial similarity, while values near 1 imply dissimilarity. Although not a true metric (e.g., it lacks symmetry and the triangle inequality), this pseudo-distance is sufficient for hierarchical clustering.
Clustering is then applied using standard agglomerative algorithms (e.g., complete linkage), where players are iteratively grouped based on their pairwise pseudo-distances. This facilitates grouping players by spatial roles and enhances interpretability in scouting applications.
4. Example: spatial distribution of soccer players
This section demonstrates the spatial similarity methodology on two datasets derived from the 2019–2020 season of Spain’s La Liga. The data, obtained from Wyscout, consists of player heatmaps constructed by aggregating activity over a discretized field space transformed into a [0,100]×[0,100] grid. Each cell contains a value representing the average time spent by a player in that region.
To ensure comparability, a weighted density smoothing approach standardizes these spatial profiles onto a shared raster grid.
4.1. Toy example
A small-scale example compares five players with distinct roles: Unai Simón (goalkeeper), Banega, Kroos, Parejo (midfielders), and Messi (forward). Figure 2 visualizes their spatial profiles, and Table 1 presents pairwise Lee’s statistics and corresponding one-sided p-values.
The diagonal of Table 1 shows high self-comparison statistics, while Unai Simón's profile diverges sharply from all others; reflected in negative Lee’s statistics and p-values near one. In contrast, the midfielders (Banega, Kroos, Parejo) exhibit high similarity among themselves, with Messi lying between them and Simón.

A hierarchical clustering using the p-value pseudo-distances (Figure 3) effectively groups the midfielders, placing Messi and Simón in progressively more distant clusters. This illustrates the method’s sensitivity to role-specific spatial patterns.

4.2. Analysis of “La Liga”, season 2019–2020
The full analysis extends the methodology to 500 players from 20 La Liga teams. A pairwise comparison across all players yields a pseudo-distance matrix, which is then clustered hierarchically. As shown in Figure 4, most similarity values are concentrated near 0 or 1, indicating that Lee’s statistic enforces a strong threshold for spatial similarity.
Clustering results (dendrogram, right of Figure 4) suggest approximately 10 distinct groups. These groups, averaging 50 players each, correspond to spatially defined positional roles.

Figure 5 confirms that reordering the similarity matrix by cluster reveals clear block structures, validating the method's discriminative power.

Figure 6 showcases heatmaps from Cluster 4, clearly identifying it as comprising goalkeepers, including J. Oblak, M. ter Stegen, and T. Courtois.
The analysis confirms that the proposed similarity index can systematically group players by spatial behavior, offering a scalable and interpretable tool for scouting and tactical profiling.








5. Discussion and concluding remarks
Player roles in football are intrinsically linked to their spatial behavior on the pitch. Consequently, comparing players through their spatial distribution offers a principled way to identify similar profiles, which is especially relevant in scouting contexts (e.g., replacing injured players or finding tactically equivalent alternatives).
The proposed spatial similarity index, grounded in Lee’s statistic, provides a robust quantitative measure for comparing heatmap-style spatial profiles. Because it is computed from proportions of time spent across discretized pitch regions, it generalizes well and can be combined with filters based on other player attributes.
Importantly, the method is flexible. Different spatial variables (e.g., offensive vs. defensive positioning) can be analyzed separately to construct multi-faceted profiles. Moreover, off-ball movements, often overlooked in event-based analyses, can be captured when tracking data is available, highlighting actions such as space occupation or movement to draw defenders.
Though this study applies the method to football, its structure is agnostic to sport and is applicable to any invasion-based team game with spatial tracking. Furthermore, the spatial similarity index can be integrated with broader data pipelines for more comprehensive scouting strategies.
Learn More
My Recommended Books
References
Gómez-Rubio, V., Lagos, J., & Palmí-Perales, F. (2025). Spatial similarity index for scouting in football. Journal of Applied Statistics, 1-14. https://arxiv.org/abs/2412.08303
To keep this article concise, please refer to the original paper for the full list of references.






