TL;DR: Bins, index and densification can help twisting and wobbling lines. A Sankey is just a slope chart―only the lines are not straight.
This week was “Teaching week” at the Data School. Each of us taught about a topic that either interested us or that we needed some reinforcement with. Jeremy taught us about Sankeys. How to build them with a step-by-step guide and a couple of revelations. It was an eye-opener.
It took me a while, despite Jeremy’s best efforts, to grasp the concept of densification; how to use the bins to set the boundaries, and how to use Index() to account for the missing values between the boundaries. But, thanks to Jeremy’s examples ―and to some further digging― I understand it a little better now, and possibilities are just starting to emerge (case in point, let’s build a tool to plot a normal distribution graph).
Revelations:
- To build a Sankey, you need three calculations:
- The first one, to set the beginning.
- The second one, to set the end.
- The third one, to tell tableau how to connect the beginning and the end.
- Once you know your beginning and your end, you can connect them almost any way you want including, why not, the bell-shaped normal curve.
Normal Distribution
My goal: to build a simple normal distribution from its parameters –mean & standard deviation― as opposed to from actual data (which Robin Kennedy has shown already).
I know the beginning (0), and the end (0), and the formula to plot it:
What I need to produce the curve is to be able to apply the above formula along the x-axis, whose values should be more or less around the mean (to be able to capture the bell shape they should start from ~ 6 SD around the mean).
Densification
I start with a dataset that has a single row with one single (dummy) value.
Connecting in Tableau to that table, and unioning it to itself yields the following table:
I am going to use one row as the beginning of my curve, and the other as the ending:
- For the beginning, I will use:
- And for the ending, the same approach, but adding the standard deviations instead of subtracting them.
To assign them to the different rows, an IF statement over the unique identifier (in this case, “Table Name”)
So far, this looks like:
Bins
And now, the first revelation: creating bins makes Tableau self-aware of the gaps.
- Create bins of X size (~1/10 or 1/100 of the SD should work)
- Put the recently created bins on rows and show the minimum & maximum (based on the parameters created: average = 10, stdev = 1, bin size = 1, number of sd to show =6)
- Now, adding “INDEX()” (and making sure it is computing using the bins (“padded”)) to Measure Values:
- How can this be used to plot the curve of interest?
We have the equation for the curve, we have now a serialized sequence between the starting and the end points -aka the index. The only thing needed is to “re-scale” the index so that it is bounded and regularly spaced:
- Write the equation.
- Re-scale the index
Re-scaling the index:
Right now, the index goes from 1 to 13 (from the smallest to the largest bin). Ideally, it should follow the same numbers as the bins (“padded”). To re-scale it:
This takes the starting point (minimum) and grows gradually in each step until it reaches the latest bin (at the maximum). This is, effectively:
y = mx + c
where y
is the re-scaled index;
x
is t
(the renamed “Index”); c
= minimum and m
= (maximum – minimum)/number of bins
.
Adding the newly calculated field to the table:
The first and the last values are almost OK, just need to be adjusted by one “bin size”, but none of the intermediate steps are computing properly. Why? The reason is the equation of the recoded index relies on “minimum” and “maximum”, but those calculations only take place on the bins that have underlying data. To account for the missing values, the equation needs to be amended to use table calculations:
And, in the table:
Equation
Now the only thing pending is to plug in the equation for the normal distribution from above:
A word of caution: PI(), needed for the computation, is also a calculation. Which means it will not be accessible for those bins with missing data –it will only work for the boundaries. Table calculations, once again, to the rescue–and we can write the equation as:
And building the table:
When plotted:
Smoothing via reducing bin size and:
Re-cap:
- The underlying data is a table with two records (here a single row with a dummy value unioned to itself).
- Minimum and maximum values define the limits of the bins.
- A parameter defines the size of the bins.
- Bins make Tableau self-aware of missing points/
- That can be used via INDEX()
- And applied to whatever formula (in this case, the density function of the normal distribution).
- A working example:
Recommended readings:
I was only interested in using densification to “plot lines” and have covered only a few aspects about Index() and data densification. To know more:
- Data Densification Demystified by Damiana Spadafora
- When 576 = 567 = 528 = 456: Counting Marks on Densification by Jonathan Drummey. Must read. Great explanations.
- Filling Blanks via Multi-Pass Data Densification by Alexander Mou. Very clean example.