Before deciding whether Amazon Redshift suits your data needs, it is essential to understand what it is. An in-depth understanding of the pros and cons of Amazon Redshift will help you make a sound decision.
What is Amazon Redshift
Amazon Web Services (AWS) is the first public cloud provider to offer a cloud-based, petabyte-scale data-warehousing service. The service is called Amazon Redshift and is the most popular cloud data warehouse.
Amazon claims thousands of businesses as its clients. Still, rivalry in this field is growing, with Google Big Query, Snowflake, and Oracle Automation Data Warehouse eyeing a share in the growing cloud data warehouse market.
Amazon Redshift has been around since 2013 and has undergone several enhancements. Amazon Redshift Spectrum, AWS Athena, and the omnipresent, massively scalable data storage solution, Amazon S3, compliment Amazon Redshift and offer all the technologies needed to build a data warehouse or data lake on an enterprise scale. Let us dig a little deeper to understand the pros and cons of Amazon Redshift in more detail.
Pros & Cons of Amazon Redshift
Pros of Amazon Redshift
Widely Adopted
Amazon Redshift has a thriving and robust customer base as one of the first cloud-native data warehousing technologies. A healthy ecosystem of knowledgeable resources is available to support organizations in extracting value from their data warehousing initiatives.
Ease of Administration
Amazon Redshift offers an assortment of tools to reduce the administrative burden typically involved in running a database. Tools are made available to create clusters easily and automate the database’s backing up to scale the data warehouse up and down. All these activities required database administrators in the past. With the specific tools available with Amazon Redshift, users can click a few buttons or call REST APIs to carry out these tasks.
Ideal for Data Lakes
Amazon Redshift Spectrum extends the capability of Redshift by allowing the system to scale compute and storage independent of each other and issues queries on data stored in S3 buckets.
Ease of Querying
Amazon Redshift has a similar querying language to the popular PostgreSQL. Anyone familiar with PostgreSQL can use their SQL skills to start engaging with Redshift Clusters. JDBC and ODBC support allows developers to connect to their Redshift clusters using the DB query tool of their liking. Redshift console also allows users to issue queries and work on the database. However, power users may prefer to use a tool of their choice. Most business intelligence tools in the market today support Amazon Redshift.
Columnar Storage
When rows are inserted into a relational database, they are typically stored in a row format. Although row formats are very efficient in writing operations, they underperform in reading operations. Columnar compression uses redundant data in each row, and a column-oriented compression approach can compress missing data in fields more efficiently. By compressing the column data, the storage footprint on the disk can be significantly reduced. A query issued on columns can scan a smaller data footprint and transfer a lower volume of data over the network or I/O subsystem to the compute node for processing. This leads to a significant improvement in the performance of analytical query processing.
Performance
Amazon Redshift is an MPP database. MPP stands for Massively Parallel Processing. Efficient implementation of columnar storage algorithms and data partitioning techniques give Amazon Redshift an edge in terms of performance.
Scalability
The ability to scale is one of the most important aspects of a database, and Amazon Redshift is no different. Scaling a Redshift cluster is simple compared to scaling an on-premises database. Internal complications involving hardware expansion, VM resizing, and data rebalancing amongst the nodes are entirely overseen by Amazon Redshift and hidden under a UI button or a REST API call.
Security
Security is a significant roadblock in many companies’ adoption of cloud services. However, it is essential to realize that cloud services offer a vastly higher degree of protection when appropriately configured than internal IT (Information Technology) teams’ security setups. The scale of public clouds enables them to hire more resources and deploy them to monitor and secure the cloud environment 24x7x365.
Amazon Webservices is no different. When we talk about Amazon Redshift security, it cannot happen in isolation. The security capabilities offered by Amazon Redshift are available to users on top of the security implementation at the cloud services layer. Robust identity and access management, role-based access control (RBAC), encryption in transit and at rest, and SSL connections are some security features in Redshift. You can read more about them here. Amazon Redshift is HIPAA, SOC2 Type II, Fed ramp, and PCI certified.
Strong AWS Ecosystem
If you are considering Amazon Redshift as your data warehouse, you have some environments already running on AWS. As important as selecting competitive applications for your workloads is, it is also essential to factor in other aspects like community support, pricing and discounting, and skillset within the company.
Selecting a technology often has both strategic and tactical implications. It may not matter to smaller organizations. However, larger organizations with well-established teams must factor in these factors before deciding on any software purchase, including selecting a data warehouse. With a wide variety of services on offer in AWS, organizations can benefit from bundling their services to get better benefits for the services used.
Pricing
Many factors contribute to the purchase price of an Amazon Redshift cluster. Anyone considering Amazon Redshift as their data warehouse must understand these factors in detail to avoid future surprises. You can read a more in-depth article on Amazon Redshift pricing here. With a wide variety of pricing models and flexibility in terms of deployment, Amazon Redshift provides something for every company, regardless of size.
Cons and Limitations of Amazon Redshift
Amazon Redshift is a data warehousing system by design. The entire service is tuned and perfected for a specific workload, analytics data processing. Certain data types, such as XML and JSON, are only partially supported by Amazon Redshift. Working with data that is not in a supported format can be difficult as a result. Suppose you are interested in a database that does efficient transaction processing. In that case, AWS has several other services like Amazon Aurora, Amazon RDS, DynamoDB, and others that you may want to consider.
Not a Multi-Cloud Solution
While the ecosystem plays a vital role in driving the choice of software, a lack of choice is seen as a mechanism by the software vendor to lock customers into their service offerings. Amazon Redshift, unlike Snowflake, is only available on AWS. If you are a user of Azure, GCP, or Oracle Cloud, then carefully evaluate solutions offered by those cloud providers before deciding to go with Amazon Redshift.
Amazon Redshift is Not 100% Managed
Although tools provided by Amazon reduce the need to have a database administrator full-time, it does not eliminate the need for one. Amazon Redshift is known to have issues with handling storage efficiently in an environment prone to frequent deletes. Maintaining sort order is also critical in achieving efficient performance metrics. These aspects of the database are not well known to developers, and one would argue that they should not care. And they would be right.
The current improvements in database technology can eliminate the need for users to understand these database administration topics and manage the database to deliver optimum performance without ever needing a database administrator. Snowflake and Oracle Autonomous data warehouses have made massive strides in this regard. Amazon Redshift has already released a slew of features like automatic table sort, automatic vacuum deletes, and automatic analysis, demonstrating progress on this front.
Concurrent Execution
Concurrent execution is a known challenge in MPP databases. In an environment where multiple simultaneous users are executing queries, Redshift could run into performance problems. In addition, due to the lack of separation of computing and storage, read workloads get impacted due to powerful writing that may be going on in the database due to a massive batch processing job.
Cluster resize causes a disruption in the service to the end-user. Although minimal disruption, the lack of seamless cluster resizes, and capability can be considered a drawback in a market where competitors offer capabilities to scale up and down without downtime. This minor disruption is tolerable for most businesses but an issue.
Choice of Keys Impacts Performance and Price
In the cloud world, performance = price.
Users must carefully design their strategies around distribution and sort keys while keeping an eye on future requirements. They should also regularly reassess the validity of their sort of key and distribution key choices as more data gets ingested into the Amazon Redshift data warehouse. A sub-optimal design can increase the costs of the Redshift data warehouse because the system performance degrades, which in turn causes user satisfaction issues. It is easy to increase the cluster size to deal with the problem, but that would increase your costs. Still, a careful key strategy allows companies to get the most out of their Amazon Redshift investment before scaling up.
Master Node
A Master Node plays a critical role in the Redshift architecture by orchestrating queries’ allocation, execution, and aggregation and their execution results. All clients only interact with the master node; therefore, a non-redundant master node creates a single point of failure for the environment.
Not a Serverless Architecture
Amazon Redshift is an old guard when it comes to cloud data warehouses. Redshift has some limitations, having been designed many years ago. A serverless architecture enables the vendor to do a higher degree of hardware optimization, which translates into lower prices for customers. The price will decrease when the same hardware gets utilized by three people vs. one. Old guards have their benefits by being around for a long time and innovating for a long time. These benefits sometimes outweigh the perceived drawbacks, and sometimes they do not.
Conclusion
The choice of a data warehouse depends on your use case, your budget, the current state of the business, and your plans to use the data warehouse. We do not believe there is an absolute right or wrong choice about technology selection. Feel free to contact us if you have questions about what data warehouse is a good fit for your business. Our data architects can guide you in making the right decision for your business.
At Saras Analytics, we passionately believe in the power of data and how organizations of all sizes can now benefit from the rapid innovations in cloud data warehousing technologies. Read our article on why we believe it is time for every company to acknowledge the advantages of a data warehouse in business and invest in data warehouses.
Our cloud-based data pipeline, Daton, provides a simple yet cost-effective way to replicate your data to Amazon Redshift. Daton has 100+ pre-built adapters for databases, SaaS applications, files, webhooks, marketing applications, and more. As a result, replicate your data to Amazon Redshift from any source in three simple steps without having to write any code in a matter of minutes.
Are you ready to leverage the power of data with Daton?