As I mentioned in my last post, I’ve been testing several different cloud data warehouses. Though I considered AWS Redshift and Google Big Query, I concluded they don’t have the feature set or compatibility with other tools that we need since our systems are fairly complex and relational and not at the enormous petabyte scale that Redshift and Big Query are designed for. I settled on snowflake and Azure as my final two options. My first choice was to try Azure’s offerings since I assumed that would be the most painless transition from our old SQL server databases.
My conclusion is that Microsoft is heading in the right direction but they still have a long way to go before making a truly cloud-optimized platform. They currently offer Azure SQL Database as well as Azure SQL Data Warehouse. SQL Database is basically just a cloud version of traditional SQL server. I moved several databases up there to test it. The two main things I liked over on-premise SQL Server were that you can scale the performance up and down as needed and they have introduced some Artificial Intelligence based database optimization tools which are a big help. The scaling of the service was much easier than with an on-prem installation (obviously) but it was still quite complicated to automate requiring you to build a powershell script to scale it up and down on a regular basis over night, for example. You’d think Microsoft would just create a simple auto-scaling or scheduled-scaling GUI to do this.
It’s been nearly a year exactly from when I wrote my last post! Time flies when you have a busy job and family life. A TON has happened in the past year and I’m excited to share some new developments. Last year at this time I had begun transitioning systems to Exasol. How has it gone? Well, a year later those sames systems are still running just as fast and stable as when we first built them. Exasol has been great yet surprisingly we have decided to move on to a different platform. Let me explain…
I’ve been pretty quiet lately for a good reason. I’ve been totally immersed in a new and exciting project of migrating many of our systems over to a new in-memory database system called Exasol. They claim to be the world’s fastest in-memory database system and so far they haven’t let me down. Complex data shaping tasks that used to take 1.5 hrs in SQL now complete in 7 minutes or less! In this post I’ll walk you through some of the reasons we chose Exasol and what my migration experience has been like.
At my company we often work with panel data and need to be able to easily group individual panelists into buckets or profiles. On a recent project I needed to identify which panelists were Light, Medium, Heavy or Extreme buyers within specific product categories based on how many times they purchased that category each year.
The criteria for what constitutes a certain buyer group was different for each category so I couldn’t just use a simple mathematical formula to figure out which group each buyer falls into. Instead I needed to compare my panelist data with a criteria file to identify which group a user falls into. Since this kind of task is commonly encountered and cannot be solved via a simple join, I thought I’d write a quick post explaining how I solved this issue with Alteryx.
Below is a screenshot of my panelist data showing the transaction count for each user ID/Category combination.
I’ve been busy on a bunch of big projects lately as you can probably tell by my neglected blog. Today I tried a new tool and thought I’d share how it works and how it can be improved on using a batch macro. My goal was to compute a bunch of correlation factors using Alteryx. Below is a sample of my data:
As you can see, each product has five weeks of data. I wanted to correlate each of my five factor columns to the Dollar Share column to see which factor was most predictive of the dollar share. The goal was to return the R Squared value of each correlation just as you could easily do in Excel using the RSQR() function. The R Squared function in Excel is built on the Pearson Product Moment Correlation function. I found the Pearson Correlation tool in the Data Investigation tab in Alteryx. Once that was dropped into the workflow, I selected which variables I wanted to compute a correlation.
In my work at TABS Analytics, I often find myself building re-usable alteryx workflows for processing sales data coming from syndicated sources such as Nielsen or IRI. The goal is to build tools that can be applied to data coming from multiple different clients. Though the basic content of the data is the same, the field names and format are often different depending on the client. A basic measure such as sales dollars, could come with many different titles such as (Dollars, Sales Dollars, DOLLARS, Value Sales, etc.) In this example I’m going to walk though how I use the dynamic rename tool to “iron out” these variations up front to avoid errors downstream.
Imagine that we have two different data sources with slightly different field names as shown below.
The goal is to build a workflow that can load both data sources, properly identify the fields, and process without errors. In this case, the first thing we’ll need to do is rename the fields so that they are consistently identified throughout the rest of the workflow. The standard Select tool won’t work since it will throw errors if you try to map two different raw field names to the same output name (i.e. mapping Dollars and Sales Dollars to both be renamed as just Dollars).