Snowflake vs Redshift – 5 Important Factors to consider
In an increasingly competitive environment, businesses must dig deep into the data to find value and leverage insights to drive decision making. Data Warehouses are a critical component of this endeavour. Data warehouses have been the lifeblood for enterprises for many decades now. The rise of cloud data warehouses has brought this capability within reach of every business that is serious about leveraging their data for growth. Amazon Redshift and Snowflake are two such cloud data warehouses that have significantly simplified the adoption of a data warehouse.
When we talk about best-in-class modern cloud data warehouse solutions, Snowflake and Amazon Redshift instant come to mind. They are both highly ranked by Gartner and offer excellent capabilities for any company planning to kick start their analytics practice. Which one of these data warehouses to select is not merely a decision of which solution is better, instead which solution is a better fit for your business and your requirements. Both solutions are highly capable and offer various features that help you store and process data. Our article on Snowflake vs Redshift gives you quick insight into the key differences between the two platforms.
Snowflake vs Redshift- Architecture
Amazon Redshift is one of the first cloud data warehouses that revolutionized and democratized cloud data warehousing. It is based on a PostgreSQL database and offers SQL compliant query language for analysts and developers to interact with the data warehouse. Technology is changing at an unprecedented pace than at any time in the past. Cloud data warehouse technology is not immune to this development, and snowflake data warehouse is an example of such progress.
There are fundamental differences in architecture between Amazon Redshift and Snowflake data warehouse. Understanding these differences is key to understanding the capability and the differences between these two platforms. The most fundamental difference between these two solutions can be chalked down to the separation between storage and compute.
Let’s look at what we mean by the separation between storage and compute and why it is essential. If you’d like to skip this foundational topic and jump right into understanding the difference between the two solutions, skip to the next section. You can always come back and read this section later.
What is the architecture of a data warehouse?
Any Data warehouse comprises essentially of four hardware components – Compute, Memory, Storage, and the network interconnect. Traditionally these four components have been tightly coupled. Compute, and memory demands on the data warehouse vary based on the workload. A traditional data warehouse comprises of infrequent data write activity and more frequent read activity. Compute requirements on the data warehouse can be classified into two types:
- Compute to Ingest data: Processing capacity needed to ingest data into storage.
- Compute to read/process data: Processing capacity required to process data retrieved from storage and memory.
Storage demands on the Data warehouse are also dependent on the type of data that is stored. The simple classification of data can be as follows:
- cold – rarely queried or updated data
- Warm – infrequently queried or updated data
- Hot – frequently queried or updated data
A company may have been in business for over a decade, but data for the last few months or a few weeks is sufficient for day-to-day reporting and analysis. Anyone who uses a computer, or a phone understands that the cost of storage varies based on the type of storage used. Let’s take the iPhone as an example – the price of an iPhone with 64GB of storage and 128GB of storage are significantly different. The cost of iCloud storage is a fraction of the fee that you will pay for the storage on your phone. One would use iCloud storage to move all of the old data (cold and warm data) on your phone that you no longer access regularly. The information that you access frequently (hot data) resides on your phone. Doing this is not only cost-efficient, but it also ensures that the data you need is always available. Once you set up your iCloud account, iOS manages the movement of data from your phone to your iCloud account seamlessly without your knowledge. Imagine a scenario where iCloud storage doesn’t exist. What do you do?
You have two options
- Increase the storage on your phone – which is very expensive or
- Delete some of the old files that you no longer need – You lose precious pictures and more
Let’s assume for the time being that you have all of your photos, emails, messages, and everything else residing on your phone. You want to look at a recent picture that you took with your friend.
- How much time do you think you’ll take to retrieve the picture?
- Have you seen your phone slow down over time? Why do you think that happens?
- Wouldn’t it be nice if the picture instantly pops up on the screen?
The time taken by the processor to process data is directly proportional to the volume of information it processes. If you have a large volume of data stored on your phone, retrieving a picture will take a while, because your CPU has to process large volumes of data stored on your phone storage before it can find the image you want to look at and render it on your screen. If you are playing a game on your phone, you will notice a lag if the game requires a lot of processing power.
- Wouldn’t it be nice if the processing capacity of your phone increases automatically based on the needs of your request?
- Wouldn’t it be nice if you can pay the same price for your on-device storage as you would for iCloud?
Unfortunately, you cannot with the current design of your iPhone. You would have to buy a phone with higher processing power and storage, leading to higher costs if you must buy a phone that caters to your peak usage patterns.
A higher price and an increase in manageability are a result of tight coupling of computing and storage.
Traditional data warehouses suffer from the same limitations as the iPhone example. Amazon Redshift also suffers from some of these same limitations. In Amazon Redshift, storage and compute cannot be scaled independently in a manner that is seamless to the user. Over the years, AWS Redshift engineers have made significant improvements to Redshift which enable Redshift to be scaled up and down more seamlessly in response to usage. Yet, compute, and storage cannot be independently skilled with Redshift. The reason for this is because Redshift relies on an architecture that is at least a decade old.
Snowflake data warehouse achieves true compute and storage elasticity and thereby addressing the challenges listed in the iPhone example above. Snowflake data warehouse leverages inexpensive object storage like Google Cloud Storage or Amazon S3 (iCloud) to store your data. It relies on Cloud virtual machines (AWS EC2 VMs) and intelligent compute management to make additional compute nodes available based on the demands of your processing request – whether it is an ingest request of a read request. Achieving independent storage and compute elasticity is one of the prime reasons that propelled the snowflake data warehouse to one of the fastest-growing cloud technologies in the market today.
However, price and manageability are not the only reasons to select an enterprise data warehouse. Let us investigate the critical differences between these two solutions.
Price – Snowflake vs Redshift
The cost to operate a data warehouse is not limited to the cost of technology. It also includes the price to manage and use the data warehouse. You can read more in-depth articles on Snowflake pricing and Amazon Redshift pricing in our blog.
AWS charges you for a pre-defined configuration of compute and storage used in a Redshift cluster. Once you spin up a Redshift cluster, you get charged regardless of whether the cluster is in use or not. If you have a predictable workload, then committing to a pre-defined Redshift cluster may offer you a better price in comparison to Snowflake.
Snowflake has a separate fee for compute and storage. Because Snowflake achieves true compute and storage elasticity, they can offer attractive pricing models to customers. For example, if you are not processing any data, you will not be charged for CPU usage. You will are charged for storage used to store your data. If you are using compute for only four hours a day, you get billed only for those four hours. Snowflake intelligently manages the availability of CPUs and ensures that you always have processing capacity available when you require. Snowflake can allocate processing resources earmarked for you to other customers during the periods where you’re not utilizing compute. By doing so, they achieve a more comprehensive optimization of hardware resources in comparison to Amazon Redshift.
Database Features – Snowflake vs Redshift
|Automated Data Encryption||Yes||Yes|
|Support Multiple Workloads Simultaneously||Yes – You can start different virtual warehouses for different types of workloads||Yes – but since the same cluster is used for read and write, there will be a conflict for resources.|
|Separation of storage and compute||Yes||Tightly coupled. Recommended to only use Redshift for hot data, and use Redshift Spectrum or Athena for Cold and Warm data to achieve cost savings.|
|SQL and Analytic Functions||Yes||Yes|
|JSON, AVRO, and ORC Support i.e. semi-structured data types||Yes – Snowflake supports all these file formats effectively/||Only native JSON support. However, processing JSON in Redshift is slow.|
|Fast Clone of tables||Yes||No|
|Persistent result sets||Yes||No|
|String length||Unlimited||Limited 65535 characters|
|Data Sharing||Yes, and this is a HUGE differentiator. With Snowflake, data providers can share their data to consumers without having to replicate the data and use up more storage. They can simple expose the data and let the consumers pay for processing this data.||No|
Snowflake vs Redshift: Integrations
A typical business these days leverages upwards of 20 different applications or data sources to operate their business. In addition to this, applications are replaced at a faster clip these days. Building a data pipeline in-house might sound like a good idea at the beginning. Still, as a business grows, and more users start using the data warehouse, in-house data pipelines quickly begin to become a bottleneck. Keeping up with the API changes, application additions or changes, and fulfilling the analytics and reporting demands of the new users is very time-consuming. Often, building a data pipeline is a highly underappreciated activity and is best offloaded to specialized software. Leveraging a cloud data pipeline like Daton can help you ETL/replicate your data from any data source to Snowflake or Redshift without having to write a single line of code.
Snowflake vs Redshift: Maintenance
The maintenance of a Snowflake data warehouse is limited to optimizing the warehouse for costs. Snowflake is a fully managed cloud data warehouse.
A Snowflake data warehouse is like using Uber while using Redshift is like leasing a car. With Uber, you use it when you need to, and you pay for the usage while leasing a car often involves more overhead – storage, repair, maintenance, etc. With Snowflake, you don’t have to worry about maintaining any component, hardware or software, of the Snowflake stack. On the other hand, Amazon Redshift would require active management to achieve optimal performance.
The problems of vacuuming and analyzing tables are all too familiar to anyone who has used Redshift. Amazon has made progress on this front by automating these activities. Yet, it would be safe to say that management effort to operate a Snowflake environment is much less in comparison to a Redshift environment.
Partitioning of data plays a significant role in determining the performance and cost of a data warehouse solution. Snowflake manages to partition automatically, whereas, in Redshift, an administrator has to manage the data partitions to optimize costs and performance.
Scaling and Resizing also adds to administrative or maintenance overhead for a data warehouse. Snowflake automatically handles these operations based on rules that can be pre-determined without incurring any downtime. With Redshift, however, re-sizing and scaling operations incur downtime and involve a higher degree of personnel involvement.
Snowflake vs Redshift: Security
The security features of a Snowflake and Redshift data warehouse are comparable. Security features offered by both these solutions may vary depending on the tier of service that you purchased. Although security is paramount for any business, it is essential to understand that building a comprehensive and air-tight data warehouse with a 360-degree security blanket comes at a cost. So, let your requirements drive your decision wrt to security.
Both services offer
- Encryption at rest with support for all major encryption protocols
- Role-Based Access Control (RBAC)
- Account/User level authentication
- Object security via RBAC and discretionary access controls
- SOC2 Type II, HIPAA, and PCI DSS Compliance
- Make sure to check the Snowflake edition for compliance certifications offered
- IP Whitelisting and blacklisting
- SSL Connections
- Encryption of data-in-transit
Snowflake Vs Redshift – What is right for you?
You are already on AWS and are leveraging many of their services. You want the data warehouse tightly coupled with the existing environment – Redshift may be a better fit.
You are just starting your data warehouse practice and are unsure about your requirements – Snowflake may be a better fit for you.
You have few users using the warehouse and may not need frequent data ingestion – Snowflake may be a better fit as it may cost less.
Sparse usage of a data warehouse – Depending on how you determine sparsely, Snowflake might be worth a try.
You don’t have any resources to manage or tune the data warehouse – Snowflake may be a better fit as it requires zero maintenance.
You have unstructured data – Snowflake may be a better fit.
You have massive volumes of data, but you are looking to use one service to process this data – Snowflake may be a better fit as it would cost less. With Redshift, you may have to use Redshift Spectrum to save on costs.
You want separation of ingesting and read workloads – Snowflake makes it possible with its independently elastic storage and compute architecture.
You want to share data with your processing partners – Snowflake is a better fit with its unique data sharing capabilities.
Decades of data warehousing expertise on your side
No matter what cloud data warehouse you choose, Daton helps you effortlessly load and keep your data in sync with your source. Daton is a modern ETL platform built from the ground up to work perfectly with cloud data warehouses. Daton is the cheapest data pipeline available in the market. It helps companies of any size leverage cloud warehouses, and our inhouse experts are knowledgeable about data warehouse nuances and are ready to help.
Got questions about which data warehouse to choose? Contact us on how to get the right data warehouse up and running.