I’ve been busy on a bunch of big projects lately as you can probably tell by my neglected blog. Today I tried a new tool and thought I’d share how it works and how it can be improved on using a batch macro. My goal was to compute a bunch of correlation factors using Alteryx. Below is a sample of my data:
As you can see, each product has five weeks of data. I wanted to correlate each of my five factor columns to the Dollar Share column to see which factor was most predictive of the dollar share. The goal was to return the R Squared value of each correlation just as you could easily do in Excel using the RSQR() function. The R Squared function in Excel is built on the Pearson Product Moment Correlation function. I found the Pearson Correlation tool in the Data Investigation tab in Alteryx. Once that was dropped into the workflow, I selected which variables I wanted to compute a correlation.
The result is a matrix that looks like this:
Note how it is a symmetric matrix correlating each variable with all the other variables. Note also that this is the raw correlation coefficient (R), so to convert these numbers into an RSquared value, you simply have to square them. In my case, all I wanted as a result was the first row of data which shows all the factors correlated back to Dollar Share, so I was able to easily add a filter tool to exclude all the other results.
A this point I was getting excited about how easy this was until I realized that this correlation was based on my entire set of records which lumped both products together. It was therefore a correlation of all products together. My goal was instead to compute this correlation at the level of each product. Note in my sample above how I have two different products identified in Column A.
It would be “oh so handy” if the Pearson correlation tool in Alteryx had a GROUP By field or something by which to specify which groups of rows you wanted to perform the correlation, but since it does not I had to go one step further and convert this into a batch macro.
The purpose of the batch macro was to pass each product’s records into the Pearson Correlation function individually (5 rows per product) and stitch together the results by product rather than just passing the whole stream in at once. Below is a snapshot of the macro’s workflow.
This macro requires at least two inputs: Group and CorrelationTarget. Group is the field that is used to group the data into batches, and the CorrelationTarget is the variable that you want to correlate other variables to (in my case Dollar Share). All the macro itself actually does is perform the correlation and then append the Group name to the result.
The powerful part comes when this macro is called from another workflow:
In this example, I take the same sample dataset I showed at the beginning and feed it into the batch macro. The batch macro accepts two inputs. The upper one is the control field input, and the lower one is the actual data itself. Into the control field input I feed a unique list of product names which is the field I want to use to batch my data. Note how I use a summarize tool above to get a unique list of the two product names.
Into the data input I feed my original data after removing any fields that I don’t want correlated. The macro itself has two configuration tabs that must be setup. The one defines what GroupBy field I want to use. This is basically what tells the macro how to batch the data. In my case I feed in my unique list of product names which are in the GROUP column in my data.
The second tab allows me to map my fields. Since there are only two required fields, it’s pretty easy. The group field is mapped to my Group column, and the CorrelationTarget is mapped to the DOL Shr column which is the field I want to correlate everything to.
The net result after running this, is a separate set of correlations for each product. You can see for Product1 that Factor 3 has the highest R Squared value and is therefore most predictive, whereas for Product2, Factor 5 is most predictive.
Now that I have this macro built, I’ll be able to compute correlations for hundreds of thousands of items automatically which is exactly what I was trying to do.
I know this can be tough to follow, so I’ve attached the complete test package if you’d like to play with this macro and sample yourself. BatchRSquared_Macro Please comment with any questions or suggested improvements!
3 thoughts on “Calculating Correlations and R Squared Values with Alteryx”
Great work on this macro. Unfortunately I’m having difficulties getting it to work. Basically I have a data set including trans id, product category, items per trans, value per trans and avg item per trans. I’ve grouped them by product category and want to set value per trans as the correlation target. When I try to enter the correlation target in the macro then there are no alternatives (completely blank). All the fields, except trans id, is set as double (same as in ur example). Do you know where I’m going wrong? Would like to use it for another case later on if possible.
Appreciate any help I can get and thanks again for a great macro.
Would you be able to email me your alteryx package so I could test? You could dummy up the data if you’re concerned about privacy.
Based on the sample you sent I was able to diagnose the issue being that your measure fields that you wanted to correlate needed to be set to a numeric datatype. By default, all fields coming out of a csv file are formatted as text which obviously won’t work for mathematically calculations. Once I added a select statement in there and converted the data types to doubles, it worked fine.