I thought I’d start off with an overall review of the Alteryx ETL tool and why I chose to give it a try.
If I had to summarize it in one word (well …two) it would be Easy/Productive.
I’d tried several ETL tools in the past including Pentaho’s Kettle tool and of course Microsoft’s SSIS tool. Both of which I found to be very complicated for an inexperienced user. By the time I finally completed the installs, loaded all the necessary drivers, and started to setup my first workflow, I was already a bit disillusioned. Since most ETL tasks require loading multiple files of the same format, one of my first tests of any tool is to setup a simple routine to merge all the CSV files in a specific folder.
I found to my disgust that doing so in SSIS required configuring a project variable to store the file names, creating a “For Each Loop” control flow object, and then configuring a dynamic data flow package to read the file names into the variable and open each file! All that to do one of the most common ETL tasks out there.
In contrast, all you have to do in Alteryx, is drag and drop an input data node into your workflow, connect to one of your many files, and then replace the file name with an asterisk. Voila, done!
It’s that kind of simplicity and productivity that got me excited about using Alteryx. Before getting started in Alteryx, I did most of my ETL tasks directly in MS SQL. While it’s true you can do most things in SQL, I’ve found it way more productive to do most types of manipulation tasks in Alteryx due to a couple key tools that SQL (to my knowledge) does not support.
Multi-Row and Multi-Field Formulas
These handy tools deserve a whole separate post (and will get ones soon), but essentially they allow you to easily apply calculations across rows or columns. The multi-row function accomplishes similar things as a cursor in SQL, but is far easier. You can create calcs that scan up and down across rows which is really handy when dealing with dirty data like formatted excel reports with subtotal lines, etc. With Alteryx, you can dump all the data into your workflow and then use tools like this to copy values down to fill in blank rows, or apply complex positional-based calcs to parse through unstructured data.
The multi-field function lets you automatically apply the same calc across a whole set of fields. This saved me a ton of time when I needed to replace all nulls with zeros across my 50+ measure columns. Instead of having to laboriously create a bunch of IsNull statements for each column as I would have to do in SQL, I just created the function once using the multi-field tool and effortlessly applied it across all my measure columns in one fell swoop.
Intelligent Union Statements
Anyone who’s worked in SQL for a while has made the mistake of trying to union two sets of data together that either had different numbers of columns, or columns that weren’t in exactly the same order. The result is a bunch of errors, or even worse – incorrect data. In Alteryx, unioning data is super easy. It doesn’t matter what columns you have or what order they’re in. You can just let it automatically configure the union based on field names or position. This greatly improves productivity and accuracy when mashing a bunch of disparate sets of data together.
One of the final reasons I like Alteryx is because it’s visual interface results in what I call “automatic documentation”. This, in fact, is so important that I believe it’s one of the key reasons my company agreed to give this product a try. In any enterprise, there’s concern over how to design systems that can be easily transitioned and maintained as personnel come and go. In SQL, which lacks the visual interface, it’s not readily obvious what objects depend on or support other objects. One therefore has the incentive to keep the number of total objects (i.e. views, stored procedures, etc.) to a minimum to avoid an overly complex jumble of objects. This results in what I call “kitchen sink” queries that do everything from type conversion, unions, complex joins, aliasing, aggregating, and more. These work okay till something breaks, at which point you’re left sifting through many lines of confusing code trying to identify exactly where the problem begins and ends.
In alteryx, everything is different. Each node can only do one thing. At first this feels extremely limiting but in the end it results in a network of separate nodes that (though appearing complex at first) are remarkably easy to troubleshoot and/or revise as needed. It’s as if you’ve converted your database into a flow chart that is actually alive. When you hit the run button, you don’t just get an error at the end, you get a visual representation of the data flowing through the nodes and obvious warning symbols at any node where something goes wrong. And the best part is that this flow chart never gets outdated or needs to be revised, since it is the real thing, and not just a visual representation of reality.
This means I can open up an alteryx project that someone else built and immediately know the purpose and sequence of every step in their transformation. I can quickly see every input and output, and know exactly where errors are occurring.
Now I think that’s pretty cool!