Alteryx stats tool for beginners: part 1 - Clustering

by Andrew Lehm

Alteryx stats tools for beginners

This will be a five part blog series providing an introduction to the Alteryx stats tools available; it is aimed at people new to statistical analysis and/or Alteryx and will give explanations of the tools and when it is appropriate to use them.

  1. Clustering
  2. Data investigation
  3. Regression analysis
  4. Classification
  5. Time Series Analysis

 

Clustering

Clustering – taking randomly distributed data points and segregating them into groups based on their properties. A great explanation on the process of clustering is given by Emily here.

With K-mean clustering, the goal is to find groups within the data (the K here represents the number of groups) and assign each point to the most similar group.

K-centroids are the central points of each of the groups that the algorithm uses to work out which data points are the closest.

There are a number of interesting real-world use cases for clustering such as classifying literary data into categories, using past cases of fraud to detect new cases, location optimisation and profiling.

 

Let’s take a look at the Alteryx tools for clustering

The predictive grouping tools

 

 

K-centroids diagnostics

 

 

Use this tool to find out the appropriate number of clusters to create further down the line.

Select the fields you want to cluster on, if your variables differ widely in their range of values then select standardise the fields and input a minimum and a maximum number of clusters you want to test, add on a browse and run the tool.

 

 

The output will contain two tables of values and two corresponding charts containing box plots for each number of clusters within the range specified.

 

 

The adjusted rand indices give a measure of how similar the points are within a cluster – you want the mean of this measure to be as high as possible, indicating all the points within the cluster are similar.

The Calinski-Harabasz indices give a measure of how separated the clusters are from one another – you want the mean of this measure to be as high as possible as well.

 

 

The size of the box plots of these two indices give an indication of how far from the mean the majority of the data points lie, generally a smaller spread is desirable.

If the adjusted rand for one number of clusters is better and the Calinski-Harabasz for another, then estimate which is a suitable compromise. Personally, I favour Calinski-Harabasz because I want my clusters to be relatively distinct, even if spread a bit larger.

 

K-centroids cluster analysis

 

 

Use this tool once you have decided how many clusters you want to create.

Make sure you select the same measures you used in your diagnostic tool, if you standardised in the diagnostic then also standardise in the analysis tool. Input the number of clusters you have found to be best from your diagnostic tool, add browses and run.

 

 

The R output of the tool gives a summary report. The cluster information table provides details on the number of points within each cluster (size), the average distance between points in that cluster, the maximum distance between points in that cluster and the distance from the centroid of one cluster to the nearest point of another cluster (separation).

 

 

Convergence after (X) iterations denotes how many times the centroids moved before finding the optimum position (see Emily’s blog above).

You want the average and max distance to be as low as possible and the separation to be as high as possible.

 

Append clusters

 

 

Use this tool after you have run the cluster analysis tool to append the cluster number onto each row of data in your sample.

 

 

Attach the O output from the cluster analysis tool and the data you inputted into the cluster analysis tool. No configuration is required other than to name the column if you wish (the default is cluster).

 

 

If you made it this far I hope this guide has been useful for you, and that you are able to think of possible use case for this method. It is a relative quick way to group data for further analysis.

Part 2 on the Data Investigation tools here