Speed: Cloud-based data warehouse architecture is substantially speedier than on-premises options,
partly due to the use of ELT — which is an uncommon process for on-premises counterparts.
Flexibility: Cloud data warehouses are designed to account for the variety of formats and structures found in big data. Traditional relational options are designed simply to integrate similarly structured data.
Scale: The elastic resources of the cloud make it ideal for the scale required of big datasets. Additionally, cloud-based data warehousing options can also scale down as needed, which is difficult to do with other approaches.
Some of the more notable cloud data warehouses in the market include Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL Data Warehouse.
Amazon Redshift architecture overview
Redshift leverages the massively parallel processing architecture in which nodes store data in slices via a columnar format. Each node has their own storage, RAM, and compute power. The main types of nodes are leader and compute nodes; the former intakes queries and assigns them to compute nodes to perform the queries.
Because compute nodes can process data in different slices at the same time, Redshift has robust query performance. Compute nodes return the results to leader nodes, which aggregate them for client-side applications. Users can connect directly to Redshift with an assortment of BI or analytics tools to query the data directly where it lives.
Google BigQuery architecture overview
Google BigQuery relies on a serverless architecture in which different machines are used by the provider to manage resources. Consequently, clients aren’t involved in the resource management process. BigQuery’s architecture supports both traditional data loading and data streaming, the latter of which is designed for ingesting data in real-time.
The main architectural component for this cloud data warehouse is Dremel, a massively parallel query engine capable of reading hundreds of millions of rows in seconds. With this approach, data is actually stored in a file management system called Colossus that puts the data in clusters made up of different nodes. The files are distributed in 64 megabyte amounts in a columnar format. Queries are issued from a tree architecture among the different machines the data is stored in, helping with the quick response times.
Snowflake data warehouse architecture
Snowflake’s architecture is similar to Redshift because it also uses a cluster and node approach. The main architectural difference with Snowflake is that the compute capabilities are separate from the storage, producing a few important advantages. The main benefit? The storage location changes depending on whether or not users require computing at the moment.
If the user doesn’t need computation, the data is tiered (meaning moved to) another storage area that is less costly, since that storage area is not used for data computation. Also, separating storage from compute enables Redshift’s architecture to easily scale up and down as needed, well beyond the capacity of on-premises data warehouses. Additionally, the components for data ingestion and analysis are integrated with the storage component.
Microsoft Azure SQL data warehouse architecture
Similar to Redshift’s architecture, the cloud based data warehouse architecture of Microsoft Azure is based on massively parallel processing. Data is stored in relational databases that, because of this architecture, run fast SQL queries of enormous complexity. Additional tools in the Azure cloud ecosystem allow users to create automated pipelines for transmitting, processing and storing data at petabyte scale. |