Lesson learned: check the data

First day of Dashboard Week today, starting bright and early being given a LOT of SQL data about football, 37 tables with one having data about tournaments (43), one about matches (1300), one about player lineups (16,500, one record per player) and most of the other about various events: fouls, cards, substitutions, even passes.

After a bit of thinking, my idea was to build a dashboard that coaches could use, by analysing how the players on their team had played in previous matches against their upcoming opponents; whether they'd fouled more than the average, or scored highly, or just not had as much contact with the ball. I checked with our lead coach Carl, who suggested it could work, with a few modifications from my original design (most notably, making sure all the players were visible as a jitter chart, rather than one at a time as a bar chart, so they were easily comparable).

The first step, then, was to get the player lineup data, which I loaded into Tableau. I realised that with the data in that table going back many years, I'd need to work out a calculation for how long each had played in total (this data was there, in terms of on/off timings – it'd be complex as there were a multitude of cases and Tableau wasn't particularly happy with the time format, but it was workable). Once that was done, I started to bring in the event data, and it was there that my problems started.

Two and a half hours into a task with a seven hour limit, I saw that the lineup table I'd worked was for leagues dating from the 2003/04 season to the 2021/22 one, whereas most of the other data was for one tournament: the Women's 2022 Euros, and so what I'd spent the last hour plus doing was entirely useless. What was worse, while there was a table for the player lineups for that tournament, it only covered the starting lineups, and each record was for one team per match; to get it to the player level, as would be necessary for my uses, would require a lot of processing - merging columns, deleting some columns, pivoting, and then splitting columns back up again.

Much as I'd wanted to use SQL for this, there was no way my skills were up to pivoting 11 columns to rows, so I had to load it into Tableau Prep to do that. I also found out, after doing that and moving on to the 'players off' and 'players on' tables to get the fuller info about time on the pitch, that the data only had 37 records for each of those; I've no idea why, but it made me realise that a large part of the data was either incomplete or, being only one tournament, not wide-ranging enough for my uses. Checking through the tables individually confirmed this – there were 10 examples of 'bad behaviour', 6 examples of offside and 35 examples of 'shielding'. Complete or not, these numbers were too low to analyse down at an individual player vs singular country level, and so I had to abandon that plan (particularly since with just one tournament, most pairings hadn't happened, and at most each country had played each other only once).

I moved to total occurrences of each type of incident, sticking with events that had happened over 1000 times as that seemed a number that should give enough room for analysis. I then realised there'd be further difficulties to present the bar chart as I wanted it unless all the event tables were unioned into one. This would be manageable, but again would take time – each event (I could still take 9) had different columns so would need to be individually prepped, which, again, I felt far, far much more confident doing in Tableau Prep, despite what I wanted. At this point, I had about two and a half hours left, with no relevant calculations, no charts, and certainly no dashboard.

Eventually I managed to pull something together – the numbers didn't work properly (didn't have time to work out the nested LODs I'd need) so I had to leave those out and just show it as a 'players relative to each other', but I did at least get the jitter plot working, with separate columns for each event, a parameter to choose which team to analyse and colours based on whether the even was a good thing or not – but it was nothing compared to what I'd originally hoped for, and all because I didn't look through the data properly first to check what I'd be able to do. Lesson learned for next time, I guess.

Author:
Tim Jeffries
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab