Calculating Correlations and R Squared Values with Alteryx

I’ve been busy on a bunch of big projects lately as you can probably tell by my neglected blog.  Today I tried a new tool and thought I’d share how it works and how it can be improved on using a batch macro.  My goal was to compute a bunch of correlation factors using Alteryx.  Below is a sample of my data:

As you can see, each product has five weeks of data.  I wanted to correlate each of my five factor columns to the Dollar Share column to see which factor was most predictive of the dollar share.  The goal was to return the R Squared value of each correlation just as you could easily do in Excel using the RSQR() function.  The R Squared function in Excel is built on the Pearson Product Moment Correlation function.   I found the Pearson Correlation tool in the Data Investigation tab in Alteryx.  Once that was dropped into the workflow, I selected which variables I wanted to compute a correlation.

The result is a matrix that looks like this:

Note how it is a symmetric matrix correlating each variable with all the other variables.  Note also that this is the raw correlation coefficient (R), so to convert these numbers into an RSquared value, you simply have to square them.  In my case, all I wanted as a result was the first row of data which shows all the factors correlated back to Dollar Share, so I was able to easily add a filter tool to exclude all the other results.

A this point I was getting excited about how easy this was until I realized that this correlation was based on my entire set of records which lumped both products together.  It was therefore a correlation of all products together.  My goal was instead to compute this correlation at the level of each product.  Note in my sample above how I have two different products identified in Column A.

It would be “oh so handy” if the Pearson correlation tool in Alteryx had a GROUP By field or something by which to specify which groups of rows you wanted to perform the correlation, but since it does not I had to go one step further and convert this into a batch macro.

The purpose of the batch macro was to pass each product’s records into the Pearson Correlation function individually (5 rows per product) and stitch together the results by product rather than just passing the whole stream in at once.  Below is a snapshot of the macro’s workflow.

This macro requires at least two inputs:  Group and CorrelationTarget.  Group is the field that is used to group the data into batches, and the CorrelationTarget is the variable that you want to correlate other variables to (in my case Dollar Share).  All the macro itself actually does is perform the correlation and then append the Group name to the result.

The powerful part comes when this macro is called from another workflow:

In this example, I take the same sample dataset I showed at the beginning and feed it into the batch macro.  The batch macro accepts two inputs.  The upper one is the control field input, and the lower one is the actual data itself.  Into the control field input I feed a unique list of product names which is the field I want to use to batch my data.  Note how I use a summarize tool above to get a unique list of the two product names.

Into the data input I feed my original data after removing any fields that I don’t want correlated.  The macro itself has two configuration tabs that must be setup.  The one defines what GroupBy field I want to use.  This is basically what tells the macro how to batch the data.  In my case I feed in my unique list of product names which are in the GROUP column in my data.

The second tab allows me to map my fields.  Since there are only two required fields, it’s pretty easy.  The group field is mapped to my Group column, and the CorrelationTarget is mapped to the DOL Shr column which is the field I want to correlate everything to.

The net result after running this, is a separate set of correlations for each product.  You can  see for Product1 that Factor 3 has the highest R  Squared value and is therefore most predictive, whereas for Product2, Factor 5 is most predictive.

Now that I have this macro built, I’ll be able to compute correlations for hundreds of thousands of items automatically which is exactly what I was trying to do.

I know this can be tough to follow, so I’ve attached the complete test package if you’d like to play with this macro and sample yourself. BatchRSquared_Macro Please comment with any questions or suggested improvements!

This entry was posted in Alteryx and tagged , , . Bookmark the permalink.

3 Responses to "Calculating Correlations and R Squared Values with Alteryx"

Leave a Reply