Fuzzy Joins for Grouping Values Into Buckets

At my company we often work with panel data and need to be able to easily group individual panelists into buckets or profiles.  On a recent project I needed to identify which panelists were Light, Medium, Heavy or Extreme buyers within specific product categories based on how many times they purchased that category each year.

The criteria for what constitutes a certain buyer group was different for each category so I couldn’t just use a simple mathematical formula to figure out which group each buyer falls into.  Instead I needed to compare my panelist data with a criteria file to identify which group a user falls into.  Since this kind of task is commonly encountered and cannot be solved via a simple join, I thought I’d write a quick post explaining how I solved this issue with Alteryx.

Below is a screenshot of my panelist data showing the transaction count for each user ID/Category combination.

Here’s a copy of the criteria file indicating the min and max purchase count that constitute each group or bucket.  For the Beverage category, for example, someone must purchase at least 1-6 times per year to be considered a LIGHT buyer.  Note the extreme buyers have no Max purchase specified since there is no upper limit to this group.

As you can see, there’s no way to easily join these datasets based on transaction count or something like that.  Instead, I had to do a fuzzy type of join which identifies users based on a range of values.

The first step was to joined these two datasets on category alone which is not unique in either dataset.  This results in each row of the panel data being replicated five times (once for ever occurrence of that value in the criteria file as you can see here.  Note how the transaction count is replicated on each row.

Next I used a formula tool to evaluate if the condition was met on each line.  If the MaxPurchase column is null then it merely checks the lower limit, otherwise it checks the upper and lower limit conditions.  The result is a 1/0 flag indicating if the condition is met.

Since none of the conditions overlap, and since one condition exists for all possible outcomes, the result is that one of the five rows for each buyer is flagged with a 1 while all the rest get a zero.

The final step is to filter just on the rows that are flagged as 1(true) which gets me back to a unique list of buyers and categories as I originally started with except that now I have the Group field identified for each user.

This process is very scalable and can be used to apply thousands of different criteria within minutes.  It could also be setup to handle scenarios where users fall into more than one group.

Here’s the final Alteryx workflow example:

Let me know in the comments if you have any questions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s