Snowflake, Snowflake Architecture, and Key Features
Snowflake is a cloud-based data warehouse created in 2012 by three data warehousing experts who were formerly at Oracle Corporation. Over the last eight years, Snowflake Computing, the vendor behind the Snowflake Cloud Data Warehouse product raised over $400 million and has acquired thousands of customers. One might wonder if there is a need for another data warehouse vendor in an already crowded field comprising traditional data warehousing technologies like Oracle, Teradata, SQL Server, and cloud data warehouses like Amazon Redshift and Google BigQuery. Well, the answer lies in the disruption caused by cloud technologies, and the opportunities cloud has afforded for new technology companies. Public clouds have enabled startups to shed past baggage, learn from the past, challenge the status quo, and take a fresh look at opportunities provided by the cloud to create a novel data warehouse product. In this article, we attempt to introduce you to Snowflake and touch upon the core technology components that make up this modern data warehouse built entirely in the cloud and for consumers of cloud technologies.
You can register for a $400 free trial of Snowflake within minutes. This credit is enough to store a terabyte of data and run a small data warehouse environment for a few days.
What is Snowflake Architecture?
Before we jump into the architecture of Snowflake, it is worthwhile to discuss the concepts of clustering and the popular clustering techniques.
Clustering Architectures: Shared Nothing and Shared Disk
The demands on applications to be online and available at all times is increasing daily. However, meeting these expectations puts substantial operational burden on the underlying computing infrastructure. Loss of functionality, an under-performing technology stack and non-availability of systems becomes a death knell for many businesses that have revenue models tied to the constant availability and performance of their technology stack. Downtimes are caused due to planned reasons like patching or upgrading or due to unplanned reasons like hardware failures or natural hazards. As companies increasingly become global organizations, they need systems that operate 24 X 7.
Clustering is the default go-to methodology adopted to increase the availability and performance of their hardware. Clustering, simply put, is the deployment of multiple processors or independent systems to tackle a problem faster, and more reliably than a single processor, while appearing to be a single unit to the user issuing the command. However, the devil is always in the details.
Why is the purpose of clustering?
Clustering is generally the go-to option to provide enhanced scalability and availability of the applications. Clusters improve scalability by providing options to supplement more computing power to the application infrastructure when required. Clusters improve availability as they ensure the availability of processing power despite the failure of one or more processing units.
A well-programmed cluster manager software manages these changes to the topology changes seamless to the end-user. Availability, usually measured in multiple 9s, is typically the primary goal of any clustering exercise. However, as mentioned earlier, clusters enable the addition of additional computing power when required to meet the demands of application processing.
What are the different types of clustering?
There are two predominantly used approaches to clustering. They are called Shared-disk and shared-nothing architectures.
In this setup, all computing nodes share the same disk or storage device. Every computing node (processor) has its private memory; however, all processors can access all disks. Since all nodes have access to the same data, a cluster control software is required to monitor and manage the processing of data, so all nodes have a consistent copy of the data as it undergoes updates, deletes, or updates. Attempts by two (or more) nodes to concurrently update the same data must be forbidden.
Enforcement of these management criteria results in a degradation of performance and scalability of the shared disk systems. Typically, a shared-disk architecture is well-suited for large-scale processing demanding ACID compliance. Oracle Real Database Clusters is one such example of a shared database architecture. Shared disk is typically feasible for applications and services requiring only limited shared data access, as well as applications or workloads that are difficult to partition. Applications that undergo frequent updates are possibly better off in a shared-nothing architecture due to the potential for a shared-disk lock management controller becoming a bottleneck.
In a shared-nothing setup, each computing node has its private memory that is not shared and its storage or disk capacity. Networking interconnects offer communication capability between these nodes. When a processing requests come in, a router routes the request to the appropriate computing node for request fulfilment. Some business rules are generally applied at this routing layer to route traffic efficiently to each node. In a shared-nothing setup, transfer of the processing rights to another node in the cluster follows a failure to any computing node.
This transfer of ownership ensures no disruption to the processing of user requests. A shared-nothing architecture offers a high degree of availability and scalability to the application. Modern web-scale technology companies Google that pioneered the implementation of these shared-nothing architectures run geographically distributed shared-nothing clusters comprising of thousands of computing nodes. This is the reason why a shared-nothing clustering architecture is the ideal choice for a read-heavy analytical data processing system like a data warehouse.
Shared-Disk vs Shared-Nothing – A quick comparison
|Expensive hardware with redundancy to handle component failure||Typically built on commodity hardware|
|High availability||Node availability is low and system availability is high|
|Relatively low scalability||High Scalability|
|Preferred in OLTP systems that require ACID compliance||Preferred in an environment with high, read/write rates|
|Data is partitioned and striped but within the storage array||Data may be partitioned and distributed across the cluster|
Back to Snowflake Architecture
Snowflake relies on the standard computing infrastructure, i.e. virtual machines available to anyone in a public cloud environment. In AWS, it is EC2, and in GCP it is the compute engine. Virtual Warehouses form a critical component in the Snowflake architecture. These virtual warehouses, by design, can process massive volumes of data with a high degree of efficiency and performance. When an incoming query is detected, computing power becomes available immediately to process the request. Similar to other database technologies, implementation of intelligent caching ensures optimal utilization of resources and to reduce the interaction between compute and storage systems. However, Snowflake deploys multiple virtual warehouses to process a request while simultaneously maintaining the integrity of the transaction, making the system ACID compliant.
Snowflake relies on scalable cloud blob storage available in public clouds like AWS, Azure, and GCP. Relying on massively distributed storage systems enables Snowflake to provide a high degree of performance, reliability, availability, capacity, and scalability required by the most demanding of data warehousing workloads. The storage layer of Snowflake is architected to support scaling of storage independent to the compute layer. This design choice works out great for the consumer both in terms of performance as well as cost. The storage layer holds the data, tables and query results for Snowflake.
By segregating compute and storage, Snowflake can fulfil and scale read requests and write requests without having to prioritize one over the other. This segregation is one of the unique features of Snowflake made possible by its ground-up redesign of the data warehouse stack. Storage management is entirely handled by Snowflake, leaving nothing to the end-user. As data loads into Snowflake, algorithms take over to processes and partitions the incoming data and creates metadata. This metadata enables efficient query processing down the line. Columnar compression applied to these partitions optimize the utilization of space as well as improve query performance. The data is also encrypted to meet the highest standards of security required by enterprise companies.
Services layers of Snowflake is where all the intelligent action happens. This layer performs various functions like authenticating users, management of the cluster, Query execution and optimization, security, encryption, and the orchestration of transaction execution. This layer runs on compute nodes that are stateless and span the entire data center. Intelligent use of metadata distributed across the cluster of computing nodes maintains the global state of transactions and the system.
When a query is issued, the services layer parses the query, compiles it, and determines which set of partitions hold the data of interest and flags them for scanning. One would expect the processing of the metadata to take up sizable computing power, and they wouldn’t be wrong to think so. However, by design, the processing of metadata happens on a separate cluster of machines which reduce the impact of the actual compute resources processing the data for the user.
Snowflake’s Multi-Cluster Shared-Data Architecture
Snowflake removes the management constraints typical of conventional data platforms. Snowflake is a cloud-native data warehousing platform. The system design offers a high degree of performance while simultaneously eliminating the need for administration overhead. The database is fully managed, and scales automatically based on the demands of the workload. In-built performance tuning, infrastructure management and optimization capabilities provides business with peace of mind. All they need to do is to bring their data and leave the management of it to Snowflake.
Snowflake Architecture is built out to be fully distributed and spans across multiple availability zones and regions and is highly fault-tolerant to hardware failures. Users of Snowflake rarely notice the impact of any failure in the underlying hardware.
Security is one of the hallmarks of the Snowflake Architecture. Data is encrypted both in transit and at rest. There are multiple authentication mechanisms that Snowflake supports, including two-factor authentication and federated authentication with support for SSO. Role-based access control and capabilities to restrict access based on pre-defined criteria. Snowflake also boasts of a host of certifications including HIPAA and SOC2 Type2. Refer to the Snowflake security documentation for more details.
Sharing and Collaboration
Snowflake offers a unique feature for data owners to share their data with partners or other consumers without needing to creating a new copy of the data. The consumer of the data only pays for the processing of the data as there is no data movement involved and their storage is not utilized. Avoid the hassles involved in FTP or email by using native sharing features provided by Snowflake that you can invoke via native SQL.
Snowflake is the only fully managed data warehouse that is available in multiple clouds while retaining the same user experience. Snowflake meets its users where they are comfortable and by doing so, reduces the need to move data back and forth from their cloud environment to Snowflake over the internet. Snowflake is available on Amazon Web Services, Google Cloud Platform and Microsoft Azure.
Performance and Scalability
Snowflake is well known for its performance capabilities. By enabling scaling of compute and storage separately, Snowflake eliminated one of the biggest bottlenecks associated with traditional database technologies while preserving everything good about traditional RDBMS technologies. Users can start by specifying a cluster size for initial deployment and scale as they need to even while the system is up and running. Scaling operations are handled transparently to the users by Snowflake.
Snowflake offers a simplified pricing experience to users. A real pay per use model supports billing on a per-second basis. Users only pay for the storage that they use, and the computing power deployed to process a request. There are no upfront costs involved or extensive planning needed to get started with your data warehousing initiative. Clusters scale to process heavy workloads and scale back down to the pre-defined size automatically. Users get billed for the expanded capacity only for the duration of use.
Where does Snowflake run?
Snowflake on Amazon Web Services (AWS)
Snowflake was initially released on Amazon Web Services (AWS) and is a cloud-native data warehousing platform for loading, analyzing and reporting on large amounts of data. Conventional on-premise technologies like Oracle and Teradata are expensive for many small to mid-sized businesses. Procurement and installation of hardware, the expense involved in the installation and maintenance of software are just a couple of reasons for this perception. Snowflake, on the other hand, is installed in the cloud and becomes available for users within minutes. Snowflake’s pricing model delivers incredible flexibility for organizations of all sizes to adopt a data warehouse as a unified data store for reporting and analytics. AWS users can spin up their Snowflake environment directly from the AWS marketplace.
Snowflake on Azure
Snowflake later launched on the Microsoft Azure cloud platform. This launch enabled the companies already on Azure cloud the choice to go with either Azure’s SQL Datawarehouse or with Snowflake as their data warehousing technology of choice. Read this blog to get a started with Snowflake on Azure. Customers who leverage Azure on Snowflake can also benefit from one of the industry-leading business intelligence products, Microsoft PowerBI. By co-locating the data warehouse and PowerBI, customers can avoid the latency that is typically involved in moving data from the data warehouse to the cloud environment that is hosting the Business Intelligence software. Follow this link to understand how easy it is to connect PowerBI to Snowflake.
Snowflake on Google Cloud
In June 2019, Google announced a strategic partnership with Snowflake to offer customers of Google Cloud Platform to leverage Snowflake as one of the choices of data warehousing technologies available in addition to Google BigQuery. Looker and Snowflake are one of the most prominent technology combination in the market currently. With the acquisition of Looker and the availability of Snowflake, GCP customers can now benefit from the shared synergy and co-existence of these products within the same cloud environment.
Saras Analytics is an official Snowflake ETL Partner. Our product, Daton, seamlessly replicates data from various data sources into Snowflake without you having to write a single line of code. With 100+ connectors to different data sources, Daton is the fastest and easiest way to replicate data to Snowflake.
Looking for Snowflake Alternatives?