As I mentioned in my last post, I’ve been testing several different cloud data warehouses. Though I considered AWS Redshift and Google Big Query, I concluded they don’t have the feature set or compatibility with other tools that we need since our systems are fairly complex and relational and not at the enormous petabyte scale that Redshift and Big Query are designed for. I settled on snowflake and Azure as my final two options. My first choice was to try Azure’s offerings since I assumed that would be the most painless transition from our old SQL server databases.
My conclusion is that Microsoft is heading in the right direction but they still have a long way to go before making a truly cloud-optimized platform. They currently offer Azure SQL Database as well as Azure SQL Data Warehouse. SQL Database is basically just a cloud version of traditional SQL server. I moved several databases up there to test it. The two main things I liked over on-premise SQL Server were that you can scale the performance up and down as needed and they have introduced some Artificial Intelligence based database optimization tools which are a big help. The scaling of the service was much easier than with an on-prem installation (obviously) but it was still quite complicated to automate requiring you to build a powershell script to scale it up and down on a regular basis over night, for example. You’d think Microsoft would just create a simple auto-scaling or scheduled-scaling GUI to do this.
The big downside of Azure SQL database was the performance/cost tradeoff. We have fairly optimized SQL servers here at TABS with a good amount of memory and raid-array drives. Our processes are very OLAP in nature involving a relatively small number of large transactions vs. many small concurrent transaction (OLTP). The performance difference between our traditional SQL server installed on a cloud VM and Azure SQL Database was HUGE! Even if we scaled all the way up to the really expensive service levels in Azure we could still barely get the performance we had on our own SQL server which left me quite disillusioned.
As a test, I created a simple script that duplicated a table several times resulting in a large amount of IO activity (a good benchmark for our type of OLAP activity). With Azure running at the Premium Tier 1000 DTU level, my script took 22 minutes to run. In contrast, it would finish in only 1.3 minutes on our traditional SQL Server. At this level of performance Azure would be costing us about $1500 per month running 10hrs/day during business hours which is much more than we pay today for SQL server. When I analyzed the performance metrics of the server, I saw that it was only using 40% of the DTUs and the bottleneck was writing to the log file.
While going through Microsoft documentation, I find that they clearly admit that Azure SQL Database won’t have as good IO performance. Here are their words:
“The key insight is that the IO capacity of a shared, commodity system is more limited than that of a dedicated server machine.”
My conclusion is that Azure SQL database can be good for small, transactional systems but is no match for our type of analytic workload. While it’s technically a cloud service, it’s really just 30 year old legacy software running in a shared cloud environment which brings with it all the limitations like log bottlenecks, index fragmentation, etc.
Next up – Azure SQL Data Warehouse
The next logical thing to try would be Azure SQL Data Warehouse since that is designed for high I/O workloads. In my tests, it did perform quite a bit faster than Azure SQL Database, but still was only on-par or slightly better than what we were getting in our normal SQL install. To be fair, I’m sure I didn’t test Azure data warehouse properly because to get the real benefit of the system, you really have to re-design your database to take advantage of it. Each table has to be assigned a distribution method since Data Warehouse is essentially just traditional SQL server sharded out across 60 nodes. The more I read, the more I became disillusioned. There’s a ton of complexity and nuance in how you tweak or distribution keys and columnerstore indexes, column statistics, etc. Contrary to my original hope, one cannot easily just migrate a SQL database into Azure SQL Data Warehouse. Though it’s the same underlying technology, there are enough changes that you have to re-design a lot. One simple example is that you can’t use join’s in update or delete statements and cross-database joins are not supported. That alone requires you to review all your code and re-engineer things.
Our goal was to go to a modern cloud-based system that is fast, simple, and cheaper without requiring a DBA with years of experience to get the benefit from it. Clearly that wasn’t going to come from Azure SQL Data Warehouse.
Snowflake is a new data warehouse that has been designed from scratch for the cloud. Unlike Azure, you pay almost nothing for storage and you only pay for the compute that you actually use. The system goes to sleep whenever you’re not actively querying it saving a ton of cost. I had initially considered it several years ago but didn’t think we could make the leap since it was missing several big features we rely on heavily like stored procedures and constraint enforcement.
I setup a trial account and started running tests. First I ran my benchmark script which duplicated a table to create a bunch of rows. In SQL Server, the script resulted in a table that was several gigs in size. In snowflake I thought I did something wrong at first because it finished in seconds and the resulting table was almost less than one megabyte! Then I realized snowflake’s smart compression was instantly compressing my duplicated table to nothing – pretty cool.
Next I tried some real-world data. I took 320 million rows of sales data which was about 17 GB when stored in a SQL table with page compression enabled. I did a few joins and wrote the result to a table. In our old SQL server it took 22 minutes (would have taken hours on Azure). In snowflake it finished in 6 minutes using the Extra Small compute level and the resulting table was only 10 GB. I got excited! Next I tried increasing our compute level. Each time I did, the query time was cut in half meaning that if I selected a high enough level I could get my test to run in seconds. And since you only pay for the minutes you use, it was the same total cost whether I ran it at 6 minutes on a low level or 1 minute at a high level. Finally I could get truly scalable performance in a database that “just works” no indexes, no log files, no execution plan hints, no backups, no maintenance plans, and almost no cost when I’m not actually using it. Though it still lacks a lot of features I take for granted and can be tough to develop in due to the lack of IDE tools and community knowledge, I’m confident that snowflake is on the right track and has the proper foundational architecture which is the most important thing. They continue to grow by leaps and bounds each year so I’m hopeful they’ll continue to round out the feature set.
Till next time – let it snow…let it snow…let it snow…