Google BigQuery – Architecture Deep Dive and Key Features

Posted By: administrator
Posted On: 29 Jan, 2020
Last Updated On: 14 Apr, 2020

Before we jump into what Google BigQuery is, it is worthwhile to understand the origins of the technology that powers Google BigQuery. Dotcom boom gave rise to a host of web-scale companies like Google, Amazon, Facebook, Twitter, YouTube, and many more. Data generation for web-scale companies is significantly larger than the traditional fortune 500 enterprises. Every user click, every search performed, every post on social media, and every press of the like button generates billions of rows of data every single day.

Traditional relational database technologies have not been designed to handle the volume and the variety of data generated by these web-scale technology companies leading to new classes of data storage and data retrieval technologies to be created to address the growing demands of performance by the users of these technologies. Imagine a Google search query taking seconds to give you back results; the entire search-based revenue model for Google would be in jeopardy because users are generally unwilling to wait a long time to see results of their actions.  

Google BigQuery

To address the issues of petabyte-scale data storage, networking, and sub-second query response times, Google engineers invented new technologies, initially for internal use, that are code-named Colossus, Jupiter, and Dremel. The externalization of these technologies is called Google BigQuery. 

Dremel: Dremel is the query execution engine that powers BigQuery. It is a highly scalable system designed to execute queries on petabyte-scale datasets. Dremel uses a combination of columnar data layouts and a tree architecture to process incoming query requests. This combination enables Dremel to process trillions of rows in just seconds. Unlike many database architectures, Dremel is capable of independently scaling compute nodes to meet the demands of even the most demanding queries. 

Dremel is also the core technology that supports features of many Google services like Gmail and Youtube and is also used extensively by thousands of users at Google. Dremel relies on a cluster of computing resources that execute parallel jobs on a massive scale. Based on the incoming query, Dremel dynamically identifies the amount of compute resource needed to fulfill the request and pulls in those compute resources from a pool of available compute and processes the request. This extensive compute pooling happens under the covers, and the operation is fully transparent to the user issuing the query. From a user standpoint, they fire a query, and they get results in a predictable amount of time every time.

Colossus: Colossus is the distributed file system used by Google for many of its products. In every Google data center, google runs a cluster of storage discs that offer storage capability for its various services. Colossus ensures that no data loss of data stored in the discs by choosing appropriate replication and disaster recovery strategies. 

Jupiter Network: Jupiter network is the bridge between the Colossus storage and the Dremel execution engine. The networking in Google’s data centers offers unprecedented levels of bi-directional traffic that allows large volumes of data movement between Dremel and Colossus.

Google combined these technologies and created an external service called BigQuery under the Google Cloud Platform. BigQuery is a cloud-native data warehouse that provides an excellent choice as a fully-managed data warehouse. BigQuery, with its de-coupled compute and storage architecture, offers exciting options for large and small companies alike. Let’s drill into some of the aspects of BigQuery that make it a compelling candidate for your data warehousing needs.

Manageability: As mentioned earlier in the post, Google Bigquery is fully-managed. Other services claim to offer this capability, but when it comes to BigQuery, the manageability aspect of the service is entirely taken care of by Google. Patching, Upgrades, storage management, compute allocation are all inherently managed by the service, leaving nothing on the plate of the users using the system. BigQuery is one service that does not require an administrator to manage the service. By offering server less execution, BigQuery abstracts away all the traditionally complex activities like server/VM management, server/VM sizing, memory management, and many more.

Scalability: BigQuery relies on massively parallel computing and a highly scalable and secure storage engine to offer users true scalability and consistent performance. A Complex software stack manages the entire infrastructure that runs into thousands of machines per region. 

Storage: BigQuery allows users to load data in a variety of data formats like AVRO, JSON, CSV, and more. A conversion mechanism converts data loaded into BigQuery into columnar storage based internal representation. There are many benefits of columnar storage, including optimal utilization of storage and the ability to scan data much faster than a traditional row-based storage format. Bigquery transparently optimizes the files loaded into the storage layer to ensure optimal query response times. From a user’s perspective, traditional backup, recovery, and cloning operations don’t find a place in Google BigQuery.

The design of Google BigQuery, and more specifically, the separation of compute and storage, enables Google to offer the Google BigQuery service in exciting pricing models. When not in use, the on-demand Google BigQuery service only charges for storage used and no additional charge for compute until the time a query is issued. This capability is a significant departure from the traditional data models which charge customers for compute resources irrespective of whether they are in use or idle.

Data Ingestion: Google BigQuery supports both streaming and batch data ingestion. Google Bigquery doesn’t charge for batch data ingestion while there is a separate charge for streaming data ingestion. Streaming data capabilities in Google BigQuery allows users to stream millions of rows of data every minute while eliminating the complexity of infrastructure management. 

Pricing: BigQuery offers a flat-rate pricing as well as an on-demand pricing model. A decision on the pricing model can be taken based on the scale of operation. Because of the segregation between compute and storage, customers with infrequent query demands like a mid-sized company or a department can benefit significantly from the infrequent usage of compute resources. They only pay for the resources used for query processing. Larger customers can pay for dedicated resources. On-demand querying doesn’t offer the same predictability as a flat-rate model, but it still makes sense for many use cases. Click here for a more detailed article on the topic.

Security: Google BigQuery supports a couple of different authentication models. OAuth based and Service Accounts based models allow granting of access to Google BigQuery resources. Users, groups, or service accounts can be granted access to Google BigQuery resources at various levels. The granularity of access control is limited to the data set level, and any tables or views under the dataset automatically inherit the permissions from the dataset. Read more about Google BigQuery’s IAM policy here. New Data Loss prevention capabilities extend security features of BigQuery by offering data redaction, making, sensitive data discovery to Google BigQuery users.

Usability: Google BigQuery offers access patterns expected of a data warehouse. It supports CLI, SDK, ODBC, JDBC, REST API, and a Google BigQuery Console that users can log into and fire queries. All these access patterns invoke REST APIs under the covers and return the required data to the user. Commonly used GUI tools, like DataGrip, can be used to connect to the Google BigQuery data warehouse and explore data in Google BigQuery. 

Data Transfer: Google BigQuery has native capabilities to load data from some Google services like Google Analytics, Adwords, among others. However, for a more considerable consolidation effort, leveraging a data replication product like Daton can accelerate data consolidation to Google BigQuery.

Google BigQuery Console
Google BigQuery Console

Google Cloud offers a $300 free trial for you to kick the tires on their services. We encourage that you sign up for the free trial using this link to get an additional $50 in free credits to try out BigQuery. 

Sign up for a free trial of Daton for 14 days and get started with your BigQuery proof of concept in a day!

Leave a comment

Your email address will not be published. Required fields are marked *

Sign up for a free trial of Daton today.

Take your analytics game to the next level

×
-