To The Point: Find and Check a Dataset's Granularity in Tableau Prep

Need to figure out which categories determine your dataset's granularity and you can't ask someone who knows the dataset already or the documentation does not explain it sufficiently? This post goes through my solution to this task using Tableau Prep.

A collection of categories (categorical data fields) determine a dataset's granularity if they can be used alone to uniquely identify a row in the full dataset.

These are the steps I follow to identify the granularity of a table within Tableau Prep:

0. Create a clean step for the table you want to look at.

  1. Identify the Category with Most Unique Values: Look for the categorical data field which has the most unique entries, i.e. the category which has the largest number displayed next to the header name. If it's not a text field but should be considered as a category (like US Area codes), hit "Detail" in the data field's menu.
  2. Check Most Repeated: Sort the category from step 1 from largest count to the smallest using the sort icon, and look at the highest count by hovering over the top value. If this says 1, you are done: this category alone determines the granularity.
  3. Filter: Right-click the top value (unless instructed otherwise in step 9) and click "Keep Only".
  4. Narrow Down Further: Repeat steps 1 to 3 until the next category with most unique values has the largest count of 1.
  5. Hypothesise the Granularity: The categories you've used to filter, plus the last category you examined in step 4, are likely the categories which determine the granularity. Make a note of these categories (if you have needed step 9, combine these categories with your previous noted categories).
  6. Check the Hypothesis: To check these categories do indeed set the granularity, add an aggregate step before your filtering. Add the categories identified and noted in step 5 into the Grouped Fields.
  7. Examine the Rows: Check the number of rows before and after the aggregation step by looking at the top left of the aggregation or cleaning step window - you may need to hover over the number to see an exact value. If the number of rows remains the same after the aggregation step, there were no repeated rows in the categories chosen in step 5 and thus these categories alone determine the granularity.

The steps above will be almost always enough to determine the granularity, but if you are unlucky the number of rows might have reduced after the aggregation step. This would mean that either there are duplicate records which could be an issue with the data source, or that the choice of value to filter by in step 2 removed important information. For the latter issue, we can fix this with adding the following step:

  1. If There are Less Rows: If the number of rows have lowered, add the data field "Source Row Number" as a measure in your aggregate step and change SUM to Count. Then add a clean step after the aggregation.
  2. Refine Choice in Step 2: In this clean step, sort the Source Row Number data field from Z to A, and filter by the highest number. This will only show you data from rows which have repeated in the selection of categories within the aggregation step. You will use one of these values only to determine your choice of filtered value in step 2 - you might need to use the search button to find this value when you return to step 2. Now return to the original (step 0) clean step, and start the process again but replacing the step 2 choices where possible using instead the values identified in this step (step 9).

Author:
Jeffrey Brian Thompson
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab