A smart way to investigate your data in Alteryx

The project of this week was really nice, I had the chance to learn more about Alteryx and its tools. I had to prepare one hour training module in Alteryx about “CREATING DISTRIBUTIONS”.

So… I started investigating about this topic and I ended up structuring the training session into 3 parts:

  1. Data investigation tools – FIRST INVESTIGATION
  2. Predictive Grouping – CREATE CLUSTERS (Benedetta explained already in this blog post)
  3. Data investigation tools – INVESTIGATE YOUR RESULTS

But what are these data investigation tools? Actually, I never heared about them before this week and this is one of the reasons I really love Alteryx: every time I need to so something really specific I end up discovering new tools and features and it is amazing!

These tools are based on R and they are really useful for a first data investigation. Before a process or analysis takes place, users need to understand the details of their data.

Let’s look into some of these tools together:

data_investigation_1

  • Association Analysis: this tool generates a correlation matrix report (one static and one interactive), determining which fields in a database have a bivariate association with one another. It is possible to determine a target field (if there is a particular field of interest). I would suggest to investigate before a wider range of data in order to get a general overview of the correlation. In the example below, I just deselected Latitude and Longitude and use the Pearson product-moment correlation:

Association Analysis

Here the interactive report:

matrix1

If you click on one of the small square, it will update a scatter plot on the right side of the report as below:

scatter_plot

Red means strong positive linear correlation (~ y increases while x increases) while blue means strong negative linear correlation (~ y decreases while x increases).

  • Distribution Analysis: this tool allows to fit one or more distributions to the input data and compare them. It is based on the statistical significance (p-values). I selected the Mean Age field and all the distributions available, in order to see if at least one of them fits it or not:

distribution_analysis

Below the report:

distributionchi_square

If we look at the Chi-Square test, it is possible to deduct that Normal, Lognormal and Gamma quite fit the distribution.

  • Histogram: shows the empirical cumulative distribution of a single numeric field by showing the frequencies of records falling in a set of continuous value ranges. I selected the field Sum_Residents and break the histogram into 10 parts (it is possible to leave it also “auto”) and I plotted a smoothed density curve:

histogram

Below the view:

histo2

All these information can be processed in order to perform predictive analysis, such as predictive grouping (clustering). Below my workflow:

wf clustering

Below the settings of the K-Centroids Cluster Analysis (I selected all the fields apart from Latitude and Longitude, divide it into 4 clusters – do not forget to check the report and to use the Append Cluster tool):

clustering2

After a process or analysis takes place, users need to investigate the results. I found some of the data investigation tools very useful.

I put as input data the output of the previous predictive grouping workflow. I used a select tool in order to transform the field Cluster into a string (it was a double before). Why? To use the data investigation tools on it!

investigations2

  • Plot of means: the tool takes a numeric or binary categorical field as a response field along with a categorical field and plots the mean of the response field for each of the categories of the categorical field. I selected Mean_Age as response field and Cluster as categorical field to group by as below:

plot of means

It is possible to observe how the age fall into the 4 different clusters:

plot of means 2

  • Frequency table: the tool helps the user understand the contents of their data. For each field selected by the user, a frequency table is produced. The frequency table is a summary of their data with frequency counts and percentages for each value in a field. In my case, I want to look at Clusters:

frequency table1

Below the interactive report:

cluster

I hope you enjoyed it.

A presto!

 

Author:
Simona Loffredo
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab