Can a Common Format Fix Football’s Data Chaos?
Introducing CDF: A Standardized Schema for Match, Event, and Tracking Data.
The following summary critically reviews the research paper titled "Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)" by Gabriel Anzer, Kilian Arnsmeyer, Pascal Bauer, Joris Bekkers, Ulf Brefeld, Jesse Davis, Nicolas Evans, Matthias Kempe, Samuel J Robertson, Joshua Wyatt Smith and Jan Van Haaren. All data, figures, and analysis presented here are drawn from their original work; I do not claim any authorship or ownership of the content. This summary has been written to provide a concise and technically informed synthesis of the paper’s findings, methodologies, and implications, while maintaining fidelity to the authors’ intellectual contributions.
1 Introduction
Analysts in team invasion sports like football, rely on six principal data sources:
Match sheet data includes core statistics such as results, lineups, goals, and disciplinary actions, and is universally available.
Video footage, captured from multiple camera angles, supports both broadcast and tactical evaluation purposes.
Event data logs annotated “on-ball” actions (such as passes and tackles) with contextual attributes like timestamps and field locations, though their definitions often vary between vendors and are subject to human error [22].
Tracking data provides spatiotemporal coordinates for players, the ball, and referees, often using fixed or mobile camera systems, or via computer vision techniques [20].
Match meta data and physical data complement these sources with contextual and biometric information.
Analytical value is maximized when these heterogeneous data types are integrated [24]. However, integration is obstructed by a lack of standardization across data definitions, formats, units, coordinate systems, and entity identifiers. This results in challenges such as event labeling inconsistencies; "even in seemingly objective events such as shots, there are disagreements about how many of them occurred in a match" [5]; as well as difficulty in aligning timestamps or spatial references due to differing vendor conventions and recording conditions [2, 30].
To overcome these fragmentation issues, the authors propose a Common Data Format (CDF): a standardized schema for representing football match data. The CDF aims to be unambiguous, complete, and extensible, supporting consistent data structures across sources and facilitating efficient analysis, integration, and error detection. It promotes compatibility, lowers technical barriers, and supports reproducible analytics. Notably, the first CDF version deliberately omits subjective or non-standardized event types, with the intent that "as concepts evolve over time and become more standardised, they can be added to future editions of the CDF."
2 Data about Football Matches
As explained before, football match data can be grouped into six broad categories. Each contributes uniquely to analytical workflows and the authors believe they all must be considered when proposing a unified data format.
2.1 Match Sheet Data
These are standard administrative records collected for all matches, professional or amateur, including team and player identities, lineups, substitutions, referees, goals, cards, and final results. This foundational data ensures consistent reporting and competition management.
2.2 Video Data
Video remains one of the most valuable resources. It includes TV footage, which combines multiple camera angles to enhance viewer experience and tactical feeds, which use high-angle fixed cameras to keep all players in view. Modern broadcasts also incorporate diverse perspectives (e.g., behind-goal or 16m-line views), making footage suitable for tactical and performance analysis.
2.3 Event Data
Event data captures semantically significant in-game actions such as passes, shots, and tackles [22], annotated with time, location, and involved players. Traditionally gathered by trained human coders employed by vendors like Stats Perform and StatsBomb, these annotations vary widely across providers. As noted, “each company has their own catalog of events,” especially for ambiguous actions like tackles or interceptions. The 2022 launch of the FIFA Football Language aims to standardize terminology, while automated event recognition is an ongoing area of development [31]. Event data is critical because of its widespread availability and semantic richness.
2.4 Tracking Data
Tracking systems capture continuous x/y (and often z) coordinates of players, the ball, and referees. Early systems used 2D center-of-body locations, while newer setups include 3D skeletal landmarks. Tracking can be collected via fixed stadium cameras, mobile setups, broadcast-derived estimates, or GNSS/LPS systems. Each has trade-offs: for example, broadcast-derived tracking offers broad availability but suffers from occlusion, while LPS provides high accuracy but limited deployment. FIFA’s recent use of combined systems (e.g., LPS for ball and fixed cameras for players) illustrates this hybrid approach.
2.5 Match Meta Data
This includes contextual details such as match type, kickoff time, weather, attendance, and pitch dimensions. Meta data is essential for disambiguating time series and aligning datasets with variable pitch sizes or external factors.
2.6 Player Physical Data
Collected using wearable sensors like GNSS devices and inertial measurement units (IMUs), this data quantifies players’ physical outputs; e.g., distance covered, acceleration counts, and heart rate. Vendors like Catapult and STATSports supply this equipment. Although tracking data can replicate many of these measures, wearable sensors remain essential for training sessions where camera-based systems are unavailable.
3 Challenges Posed by Football Data
According to the researchers, football data presents four major challenges: inconsistent data specifications, varied representations, non-uniform delivery formats, and low data quality, especially due to manual collection.
3.1 Data Specification
Vendors define and collect event data differently. “Event data collected by one provider for a specific game can markedly differ from the data collected by another provider for the identical game.” For instance, the definition of a "cross" may vary by location, height, or intent. These discrepancies stem from differing tactical focuses, target audiences, and resource constraints [6]. Since the Laws of the Game only define a minimal set of events [7], vendors must independently construct broader taxonomies, leading to mismatched interpretations and inconsistencies in derived metrics like expected goals [2] or pressure metrics [1].
3.2 Data Representation
Data vendors differ in how they encode identical content. Coordinate systems may center at the pitch's midpoint or a corner; pitch coordinates may be raw or normalized; playing direction may be fixed or team-relative; units range from meters to yards; time is expressed in UTC or local game time; and entity IDs are proprietary. As a result, spatial and temporal alignment between datasets becomes non-trivial.
3.3 Data Delivery
Formats and delivery pipelines also diverge. Some vendors use JSON or XML, others prefer flat or nested structures. Delivery methods include REST APIs, FTP servers, or message queues like RabbitMQ. While some vendors offer libraries for ease of integration, these differences impose additional burdens on data engineering.
3.4 Data Quality
Manual data collection leads to systematic and random errors. These include incorrect timestamps, player IDs, or coordinates, missing meta information, and undetected events; especially when broadcast video lacks complete pitch coverage. As the authors note, “even within the same provider, a very similar situation may [be] classified differently for no apparent reason.” Annotation subjectivity and limited video access compound the problem.
4 Related Work
Standardized data formats are well-established in other domains like healthcare and transportation. This serves as a blueprint for how open, standardized formats can facilitate commercial and public utility.
In football, several partial standardization efforts have emerged. SPADL (Soccer Player Action Description Language)[12] proposes a unified tabular format for event data that eases computation of metrics like expected goals (xG) [25] and expected threat (xT) [27, 28]. Its open-source nature supports cross-provider compatibility and is widely used in both academic and industrial contexts.
For tracking data, FIFA and FC Barcelona have developed global Electronic Performance and Tracking Systems (EPTS) standards [13]. Additionally, open-source libraries like kloppy [32] and floodlight [23] parse both tracking and event data, and several tools exist to synchronize event and tracking timelines [21, 30].
Despite these efforts, no initiative has unified all match data types under a single schema supported by a broad stakeholder base. Similar to Robertson et al. [26], the present work introduces the Common Data Format (CDF) to address this fragmentation by offering a consolidated and extensible data framework.
5 The Common Football Data Format
The Common Data Format (CDF) introduces a standardized interface for football data to streamline analysis across clubs, federations, and researchers. It emphasizes clarity (unambiguous), analytical utility (sufficient), and adaptability (extendable), aiming to reduce integration costs and improve error detection.
5.1 Specification of the CDF Schema
CDF supports five core data types: match sheet, video, event, tracking, and match meta data. Only match sheet data is mandatory; if event or tracking data is included, meta data is also required. Event and tracking data must be synchronized in time and space [2, 3, 21, 30].
Each data type includes required fields (for all the detailed list please read the original paper):
Match Sheet Data: Identifiers, team/player info, lineups, goals, substitutions, cards, and match outcomes.
Video Footage: Frame rate, resolution, start time, camera perspective, and whistle timestamps for major events.
Event Data: Includes only objective or widely accepted actions. Annotated with type, subtype, success, involved players, spatial coordinates, body part, and related event IDs.
Tracking Data: Captures center-of-body positions per frame for all players and the ball, denoting ball possession and play status. Skeletal tracking uses a separate schema with landmark-level coordinates and visibility flags.
Match Meta Data: Includes competition context, pitch dimensions, kickoff time, team/player info, and vendor/versioning details. Skeletal hierarchies must conform to a glTF2.0-like structure if landmarks are used.
All fields must follow the defined schema; optional fields must adhere to predefined names and types if included. The format employs semantic versioning for maintainability and is supported by a Python validation tool.
5.2 Representational Conventions in the CDF
CDF mandates UTC time and metric units. The pitch coordinate system centers at (x, y, z) = (0, 0, 0) with:
Home team always plays left-to-right (X-axis); this simplifies normalization. Events beyond pitch boundaries are captured using out-of-range coordinates. Timestamps, rather than relative times, accommodate interruptions. Each tracked entity (e.g., players, teams, referees) must have a unique identifier, and missing values must be denoted using null. Floating-point values are limited to three decimal places for conciseness.
5.3 Structure of Delivered Data
CDF supports both post-match and live applications via compact and interoperable formats:
JSON (.json) is used for static data (e.g., match sheets, meta).
JSON Lines (.jsonl) is used for real-time data (e.g., events, tracking).
This dual-structure simplifies ingestion, enables extensibility, and supports popular processing libraries (e.g., pandas, PySpark).
Match Sheet and Meta Data: Delivered as structured JSON. Skeletal tracking requires metadata on landmark hierarchy.
Event and Tracking Data: Delivered as line-separated JSON objects. Skeletal tracking is stored in a separate
.jsonlfile, with coordinates for each landmark and a visibility indicator. This modular design ensures efficient access for use cases that do not require full-body tracking.
6 Conclusions
The proposed Common Data Format (CDF) offers a unified schema for five key football data types: match sheet data, video footage, event data, tracking data, and match meta data. Its design directly addresses major challenges in the current football data ecosystem; variability in what is collected, how it's defined, how it's structured, and how it's delivered.
By ensuring a uniform, unambiguous, and analysis-oriented structure, the CDF reduces technical friction for users, supports interoperability, and simplifies the integration of diverse datasets. While only a minimal set of required fields is enforced, the schema is designed to be flexible and expandable, allowing users to include optional fields or emerging metrics as standards evolve.
Although this is only the initial release, it provides a critical foundation for more reliable, reproducible, and scalable football analytics. Future updates will extend coverage as definitions mature and data technologies advance.
Learn More
References
Anzer, G., Arnsmeyer, K., Bauer, P., Bekkers, J., Brefeld, U., Davis, J., ... & Van Haaren, J. (2025). Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer). arXiv preprint arXiv:2505.15820. https://arxiv.org/abs/2505.15820
To keep this article concise, please refer to the original paper for the full list of references.






