What are Snowflake Warehouses and What are their Advantages
Creating a data warehouse in the past often required you to buy expensive external hardware to support your data center. However, the launch of Snowflake has eliminated the problem. There is no longer a need for separate data marts, lakes, and warehouses. You can also securely share data across an enterprise. And that’s not all, as there are several more benefits of Snowflake warehouses. Let’s understand them in detail through this article.
What are snowflake warehouses?
Snowflake is a cloud-based data warehouse built on different platforms, like Amazon Web Services (AWS), Microsoft Azure cloud infrastructure, Google cloud platform (GCP). There is no need for any software or hardware to configure, select, install or manage. It has made Snowflake ideal for those organizations that don’t want to dedicate resources for setup, maintenance, and support of servers. Any data can be moved easily into snowflakes using different ETL tools.
To understand snowflake warehouses, it is essential to consider the following aspects:
- The architecture of snowflakes.
- How virtual warehouses work in snowflake architecture.
- Size of Warehouses.
- How data is loaded using warehouses in snowflakes.
- How a query is processed in snowflake.
- What are snowflake partners?
Architecture of Snowflake
Snowflake architecture consists of three layers. These are Cloud services, Virtual warehouse/Query processing, Database Storage. They remain physically independent and logically dependent on each other.
- Cloud services layer
This layer contains a collection of services that helps perform tasks and activities across snowflakes. The cloud service layer ties different components together to process the queries. This layer also provides security like access control. It is also the brain of the snowflake.
- Query processing layer
This layer comprises virtual warehouses responsible for executing data processing tasks across the snowflake. Each virtual warehouse contains clusters, accesses data in the storage layer, and works independently.
- Data storage layer
This layer is responsible for storing data and query results. The data gets stored in an optimized, columnar, and compressed format. Then snowflake stores this optimized data in cloud storage. The data objects stored in snowflake are not visible nor directly accessible to users. They are accessible through SQL queries that run using snowflake.
Snowflake Virtual Warehouse
We have seen the architecture of snowflakes in the second layer of the snowflake (Query processing layer) containing virtual warehouses.
A virtual warehouse in snowflake contains a cluster of database servers deployed on-demand to show the result required by the user. On a traditional on-premises database, this is MPP (Massively Parallel Processing). It needs fixed hardware for deployment.
However, on the snowflake, a virtual warehouse has clusters of database servers that consist of CPU cores and memory. They are maintained in the hardware pool and deployed in just milliseconds. The minimum size of the warehouse is XSMALL and the maximum is 5X-LARGE.
While executing any DML queries in snowflake, a warehouse must be specified as the current warehouse. It should run for the session in which the query statement is running. A snowflake can have only one current warehouse at a time. Once you set any warehouse to your current session, queries that you submit in that session get processed by the snowflake warehouse.
Creating Virtual Warehouse
Creating a warehouse in snowflake is achieved with create command and some optional parameters or with a web interface.
CREATE [OR REPLACE] WAREHOUSE [IF NOT EXIST] <name> [ [WITH] object properties ]
WHEREHOUSE_SIZE= XSAMLL | SMALL | MEDIUM | LARGE | XLARGE | XXLARGE | XXXLARGE | X4LARGE | X5LARGE | X6LARGE MAX_CLUSTER_COUNT = <num> MIN_CLUSTER_COUNT = <num> SCALING_POLICY = STANDARD | ECONOMY AUTO_SUSPEND = <num> | NULL AUTO_RESUME = TRUE | FALSE INITIALLY _SUSPENDED = TRUE | FALSE RESOURCE_MONITOR = <monitor_name> COMMENT = ‘<string_literal>’
CREATE OR REPLACE WAREHOUSE WH1 WITH WAREHOUSE_SIZE = ‘X-LARGE’ MAX_CLUSTER_COUNT = 2 MIN_CLUSTER_COUNT = 1;
Considerations to keep in mind while working with snowflake warehouses
- Loading Data
Loading data in snowflake tables or stages requires a virtual warehouse. A virtual warehouse is another term for a compute cluster that provides the modern data warehouse. You can leverage compute resources at any time for SQL execution and DML (Data Manipulation Language). Data gets loaded into snowflakes from one or more sources. It thus requires more than one server to process the data.
Data can continuously come to Snowflake from sources like Facebook. The load gets handled by multiple servers. This problem is solved using the multi-cluster warehouse, which contains minimum and maximum clusters in Snowflake. According to the need for data load, the virtual warehouses’ clusters increase and decrease.
The maximum cluster size supported by Snowflake is ten, and the minimum cluster size is one. Once the data load gets increased while loading, Snowflake can automatically increase the size of the cluster to handle the load.
- Scaling up and scaling out
Resizing a warehouse increases the query performance for large and complex queries. It reduces the queuing if a warehouse does not have enough compute resources required to process queries fired on data concurrently.
- Auto suspension
You can set the warehouse to suspend when there is no use of the warehouse in current activity after the specified time. Auto suspension happens by listing time in minutes, hours, etc.
Snowflake utilizes billing per second. So it is best to specify the auto-suspend time to a low value like five or ten minutes. It helps consume fewer credits when a warehouse is not in use. However, the value you set must match the gap, if any, in your workload. If you have very few gaps, like two minutes between queries, you do not need to set the auto suspend.
Snowflake warehouse can also be set to auto-resume when it is inactive or auto-suspension mode when it again gets resumed upon the submission of new queries. You have to note that a short time is essential in the auto-resumption of snowflake warehouses because of provisioning.
- Warehouse size
Warehouse size is nothing but a set of computing resources available in a warehouse. Snowflake supports the different sizes of a warehouse listed in the following table.
- Query processing
Query execution happens in the under-processing layer. When a snowflake processes a query, it uses virtual warehouses. Each virtual warehouse consists of an independent cluster that doesn’t share compute resources with other warehouses. As a result, each warehouse does not impact the other virtual house performance.
- Warehouses for client Utilities/Drivers/Connectors
In addition to the default warehouses for the user, any snowflake client (Python connector, ODBC Driver, SnowSQL, JDBC Driver, etc.) can have default warehouses. Snowflake supports command line and configuration files for specifying the default warehouse. The external connectors and drivers support a default warehouse as a connection parameter when initiating the session.
- Snowflake Partner
Snowflake supports different partners to connect warehouses and query the data which are mainly divided into five types.
- Data Integration.
- Business intelligence.
- SQL Dev & Management.
- Security & Governance.
- ML & Data Science.