It’s a good problem to have — your application is gaining traction, and the amount of data you have to analyze has exploded. Your queries start taking way too long, and you know that your data has become too large to be managed by a conventional database.
It’s at this point that you start looking for a way to keep your data organized and make it easily accessible for analytics and reporting — a data warehouse.
What is a Data Warehouse?
At its simplest, a data warehouse is a central repository of data, which very often originate from disparate sources. Businesses use data warehouses to analyze all of their data in one place and to make more informed decisions. Data can come into the data warehouse from various sources including relational databases, transactional systems, and other sources.
Data warehouses enable businesses to keep enormous amounts of data in one place since operational and transactional databases are typically not designed for historical reporting or complex queries across terabytes or petabytes of data.
How Do Data Warehouses Work?
Just like a relational database, data warehousing systems define a schema usually consisting of multiple tables which themselves contain columns of some data type (text, number, etc.) While data warehousing systems may look similar to relational databases on the surface, they differ drastically.
Relational database systems are optimized for row-based operations. The common query that selects one or more rows of data can be quite fast since rows are stored next to each other on disk. However, performing an aggregate query on a particular column for the entire table results in a scan of the entire table, which is slow.
Data warehouses take the opposite approach, where their columns are stored next to each other on disk. The benefit of this design is that you can quickly perform analytical queries on a column since the necessary data can be loaded from disk very quickly, without the overhead of the unneeded parts of the table.
Foreign keys and constraints is another area where data warehouses differ drastically from their relational cousins. Unique constraints for say, a social security number provide data integrity in a relational system, ensuring no two SSNs can be added to the same table. However, there are performance costs associated with this enforcement as the engine must validate this unique check whenever new data is added. Data warehouses relax or completely remove constraints like these. In a system which is managing petabytes of data across multiple servers, it is simply too costly to enforce or check constraints like this.
Pros and Cons of Amazon Redshift
Your data warehouse will be the foundation for your company’s applications for years to come, so choosing the right technology is crucial. You should make sure the tool you pick is capable of meeting growth and comprehensive requirements of the organization both today and in the future.
Amazon Redshift is a cloud-based, fully managed, petabyte-scale data warehouse service by Amazon Web Services (AWS). In this blog post, we’ll explore the pros and cons of Amazon Redshift to help you make a decision.
Amazon Redshift Pros
First, let’s look at some of the advantages of Amazon Redshift:
- High Performance — Redshift achieves high performance using massive parallelism, efficient data compression, query optimization, and distribution. Using its Massively Parallel Processing (MPP) architecture, Redshift can parallelize data loading, backup, and restore operations. Additionally, queries that you execute get distributed across multiple nodes. Redshift’s columnar storage database is optimized for massive and repetitive type of data, dramatically reduces the input/output (I/O) operations on disk, improving performance.
- Speed — When it comes to loading data and querying it for analytics and reporting, Redshift is extremely fast. MMP allows you to load data at blazing fast speeds.
- Scalability – Scalability is crucial for any data warehousing solution, and Redshift performs well in this arena too. It is horizontally scalable, meaning whenever you need to increase the storage or need it to run faster, you can add more nodes using AWS console or Cluster API, and it will upscale immediately.
- Transparent and Competitive Pricing – Redshift is considerably cheaper than alternatives or an on-premise solution. Redshift has two pricing models that give you the flexibility to categorize the expense as an operational expense or capital expense.
- SQL Interface – The Redshift Query Engine has the same interface as PostgreSQL, which means developers who are already familiar with SQL won’t have a steep learning curve to get going. Since Redshift uses SQL, it works with existing Postgres drivers, easily connecting to most business intelligence tools.
- Security — Redshift comes packed with security features, including various ways to handle access control, Virtual Private Cloud (VPC) for network isolation, data encryption etc... You can launch Redshift clusters inside your VPC so you can define security groups and restrict inbound or outbound access to your Redshift clusters.
- AWS Ecosystem – Many businesses are already running their infrastructure on AWS. As you’d expect, Redshift works very well with the rest of the AWS infrastructure tools. For example, when loading or dumping data on S3, Redshift uses MPP to move data very quickly.
Amazon Redshift Cons
Now that we’ve addressed the benefits of Redshift, let’s talk about some of the limitations and disadvantages.
- Limited Support for Parallel Upload — Redshift can quickly load data from Amazon S3, relational DyanmoDBs, and Amazon EMR using Massively Parallel Processing. But Redshift doesn’t support parallel loading from other sources. If you’re working with other data sources, you’ll need to use an ETL solution, JDBZC inserts, or scripts to load data.
- Uniqueness Not Enforced — Redshift doesn’t offer a way to enforce uniqueness on inserted data. So if you have a distributed system that writes data on Redshift, you’ll need to handle the uniqueness yourself either by using some method of data de-duplication or on the application layer.
Amazon Redshift is a powerful solution for data warehousing, and after reading this blog post you should have a better understanding of some of the pros and cons of Redshift. It has some limitations, but it is still our preferred data warehousing solution.