When first receiving a dataset, it can be tempting to dive straight into analysis. However, it is important to take a couple steps back and get a good overview and understanding of what the data is measuring, and what preparation is required before the data is analysed.
It is good to approach the data with a set of guiding questions as this will likely improve the quality of the analysis and reduce the number of hurdles encountered when beginning analysis and visualisisation.
So, what questions should you approach the data with?
What is the data that I / my team have been given?
This may seem like an obvious question, but it is an easy one to trip up on especially if you are expecting a specific type of data and receive something fairly similar but not quite the same. For example, you may receive sales data aggregated by city when you were expecting data for each individual store. Or, you expect each entry to consist of individual sales records but it is based on individual customers.
What does each individual row represent?
This links to the previous point, and is one of the first things that you should try and define as clearly as passible. A good approach is to identify which data field or combination of fields make each of the rows unique.
What measures do we have present in the data, if any?
Are we looking at data which can be measured, and do the measures look as we expect them to? For example, we would expect quantity values for stock to be recorded as whole numbers or sales values to be recorded as positive numbers, unless there is a specific reason and meaning attributed to recording these values differently.
Are there date fields and what level is the date recorded on?
On first glance at the data, we may find a data field with a full date and time record. However, this can be misleading and does require a closer look. Whilst a record may be formatted as '01/01/2023, 00:00:00', on examining other records, we could find that every date is recorded as the first of a month in 2023, with a time of midnight. This could mean that the data table is updated on the first of each month at midnight as the system automatically pulls through all the sales data for each store. It is still useful to have this in a date format, but this would mean that analysis can only be completed at the month or year level.
Are there any hierarchies in the data?
This is especially useful for any geographical fields we may have, as the data may nest together and include different levels of detail for example going from Country -> Region -> City -> Postcode. This is useful to note because in certain visualisation tools it can allow drill-down features for the data.
Do we have all the data we need? Is the data missing anything important?
It is important to consider not only what we have in the data, but what else we may want to have to improve the quality of the analysis. Do we need to supplement the data with any additional measures? Is there additional detail we can add? This is something which needs to be considered closely with the research question or user story.
These are just some of the key questions that will get you started when first interrogating a data set. Ideally, looking at the data through the lens of these questions will provide additional questions and insights for further investigation.