I noticed that weren't many blogs explaining the statistical theory behind data so in this blog I hope to try an explanation for some basic stats following my stats lesson this week.
Spread of data
When the data is skewed, not distributed evenly, then the median is a better representation of the middle value than the mean. There are 3 types of skewness:
![](https://www.thedataschool.co.uk/content/images/2021/11/image-15.png)
You can create a histogram in tableau using the bins function to check the skewness. A histogram divides a continuous data variable in to classes the size of the class is multiplied by the number of points in that class to create a bar. If you want to remove the skewness in your data you can change the scale to logarithmic as it pushes the data points in the tail sections higher. However, remember to mention in the title that you are using a log scale.
In tableau you can select the summary card in the show/hide cards tool in the menu:
![](https://www.thedataschool.co.uk/content/images/2021/11/image-19.png)
to get a summary of your data: avg, median, max and min of your data values. By right clicking on a mark you can choose to explain data which gives you some more information about that particular mark compared to the other values in view.
Skewness can also be observed in box plots:
![](https://www.thedataschool.co.uk/content/images/2021/11/image-16.png)
Box plots divide the data in to 4 parts known as quartiles. 25% of the data is below Q1 | 50% below Q2 and 75% below Q3. Apart from showing skewness box plots can also show outliers/extreme values using the formulae:
Q1-(1.5*IQR) for outliers/data points that are very small
Q3+(1.5*IQR) for outliers/data points that are too big
IQR = interquartile range = Q3-Q1
Standard deviation and variance are a good way to measure the spread of the data around the mean. Standard deviation is more commonly used because it is the square root of the variance and you can directly compare this to the data values and mean as they are unsquared. When calculating standard deviation you can use n or n-1 depending on whether your data size is the entire population or just a sample. n-1 is used to compensate for any misrepresentation in your calculated values for not using all of the possible data points e.g. the mean of a sample would not be a true representation of the actual average if the entire population was used, therefore to make up for that loss of degree of freedom you would use n-1 when calculating the standard deviation of a sample.
Hypothesis Testing
A useful procedure in stats is hypothesis testing where you test the legitimacy of a statement by analysing the data such as after completing a survey/collecting data to see if you have a result that is actually statistically significant. Usually you have a null hypothesis and an alternate one, both of these hypotheses are assumptions you make about the data and they are direct opposites of each other e.g. employment is on the rise (alternate) | employment is not on the rise (null). A p value can be calculated to test the probability of receiving an extreme value in your results.
![](https://www.thedataschool.co.uk/content/images/2021/11/image-17.png)
Depending on the skew of you data and the number of values in your sample the critical value varies. In a normal distribution the critical value = 0.05 (5%). If your p value is higher than the critical value then your result is not statistically significant and the null hypothesis is true which warrants further investigation as you need find out what is actually going on in the data. Referring back to my previous example if employment is not on the rise then is it going down? This is another hypothesis and must then be tested.
Alteryx
In Alteryx there is a data investigation and predictive tools pane which can be used to analyse your data. An association analysis can be done to see how closely related fields are to each other. The value that is produced is either a Spearman's or Pearson's correlation. They are essentially the same but Spearman's is usually used when the data points are already very close to a straight line while Pearson's is normally used when the data points are quite scattered. If their value is close to 1 then there is a strong positive correlation between the fields being tested | close to 0 they are most likely unrelated | close to -1 they oppose each other (as one increases the other decreases). This relationship is usually best presented as a scatter plot in Tableau.
Data Investigation Pane
In the Spearman's correlation tool there is a group by option which allows you to group your fields in test and you can see the correlation of the test fields within each group. For example there may be a relationship between selling price and dealer pay for petrol cars but not for diesel so it is important to split up the data in to groups so you can see this relationship.
A distribution analysis tool can be used to see the skewness of your data. This tool also gives you values for your mean and standard deviation.
Predictive Pane
A t test can also be calculated to see how similar to data sets are to each other. This can be seen in the test means tool.
If you notice from you correlation coefficient that your fields are closely related - Spearman's/Pearson's are close to 1 you can use a linear regression tool to predict data points. This fits you data to the y=mx+c relationship and you can test if a linear model is the best fit for your data. In this tool, an R^2 value is produced and the output also shows you which fields are the most significant and would contribute the most to the overall R squared value. If you notice that certain fields are less significant then you can remove them from the configuration of the tool which can help to simplify the model but will slightly reduce the accuracy of your R^2 value, so it's your call to make to have a more complex but accurate model or a simpler less accurate one.
A score tool can then be attached to test how good our predictive model is. Even if your Spearman's/Pearson's coefficient is close to 1 a linear regression model may not be the best predictive model for the data. The score tool measures how close the actual value is to the predicted value and doesn't require any configuration. A model comparison tool can also be used to do the same operation.
Many of the predictive and data investigation tools are macros with a lot of tools running in the background so just to note that workflows with these tools will take longer to run.
I hope this blog has helped to explain some basic stats and shown how the discipline can be applied to Tableau and Alteryx. Thanks for reading if you got this far! :)