Definition

snowflaking (snowflake schema)

Robert Sheldon

By

Robert Sheldon

What is snowflaking (snowflake schema) in warehousing?

In data warehousing, snowflaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables. A snowflake schema is a variation of the star schema that normalizes the dimension tables to increase data integrity, simplify data maintenance and reduce the amount of disk space. In certain situations, it can also improve query performance.

A star schema consists of a central fact table that references multiple dimension tables. Each dimension table is denormalized ("flattened") to avoid the query overhead that comes with a highly normalized schema, which can require a large number of joins to retrieve the necessary data.

Figure 1 shows a basic example of a star schema. It includes one fact table (green) and four dimension tables (blue). The fact table contains multiple foreign keys that reference the dimension tables. The data in each dimension table is denormalized, making queries fast and efficient.

Diagram illustrating basic star schema. — Figure 1. A basic example of star schema in data warehousing.

Normalized dimension tables can contain a significant amount of redundant data. For instance, the dimTerritory table includes the TerritoryName, TerritoryCountry and TerritoryRegion columns. Together, these columns form the hierarchy TerritoryRegion > TerritoryCountry >TerritoryName. An example of this might be North America > Mexico > Baja California.

Columns such as TerritoryCountry and TerritoryName can have a low cardinality, which refers to the number of unique values relative to the number of rows. The more rows, the lower the cardinality and the more redundant data. Although data queries are generally faster in a star schema, the schema can also be more prone to data integrity issues and require more disk space than a more normalized structure.

In a snowflake schema, a fact is surrounded by its associated dimensions (as in a star schema), but those dimensions are further related to other dimensions, branching out into a snowflake pattern. Snowflaking normalizes the dimensions by moving attributes with low cardinality into separate dimension tables.

The fact table in a snowflake schema uses foreign keys to reference the core dimensions, which in turn use foreign keys to reference the dimensions at the next level. The illustration in Figure 2 shows a snowflake schema that was created from the star schema shown in Figure 1, with each of the original dimensions now normalized.

Diagram illustrating basic snowflake schema. — Figure 2. A basic example of snowflake schema in data warehousing.

The fact table still links to the four core dimensions, but those dimensions have been normalized in the following ways:

The dimCustomer dimension no longer includes the CustomerInterest1, CustomerInterest2 and CustomerInterest3 columns. Instead, a separate dimInterestArea table has been created, with a bridge (junction) table added between the dimInterestArea table and the dimCustomer table. The bridge table makes it possible to support a many-to-many relationship in which one customer can be associated with multiple interest areas and an interest area can be associated with multiple customers. It also enables customers to be associated with more than three interests.
The ProductType and ProductSupplier columns in the dimProduct table have been replaced with key columns that reference the new dimProductType and dimSupplier tables, removing the redundant data from the dimProduct table.
The hierarchical data in dimTerritory table is now fully normalized. The table contains a foreign key that references the dimCountry table and the dimCountry table contains a foreign key that references the dimRegion table. The data is now spread across the three dimension tables, eliminating the redundancy in the original dimTerritory table.
The month and day data have been removed from the dimDate table and replaced with key columns that reference two new dimension tables, eliminating the many duplicated instances of the day and month names. The dimDate table can also be normalized in other ways, as warranted by the supported workloads and date range covered by the dimension.

Some sources consider a schema to be snowflaked if at least one of the dimensions is normalized, while other sources insist that all dimensions must be normalized. A schema that is only partially normalized is sometimes referred to as a starflake schema because it combines the characteristics of both the star and snowflake schemas.

What is the purpose of the snowflake schema?

The normalized dimensions in a snowflake schema reduce the amount of redundant data, making them less susceptible to data integrity issues than a star schema. Whenever multiple copies of the same data are stored in a database, as is the case with the star schema, there is a greater risk that extract, load and transform (ELT) operations will result in problems with the data.

With a snowflake schema, data maintenance is easier and less likely to cause data integrity issues. There is less redundant data, and that data is organized into separate tables, simplifying the processes of adding data and updating data. A snowflake schema also requires less disk space for data storage.

That said, a star schema is easier to set up than a snowflake schema. Developers can also build and update queries more easily, and those queries generally perform better because there are fewer joins. Many sources, including the Kimball Group, a data warehousing consultancy, generally recommend against the snowflake schema, although other sources, such as IBM, suggest that snowflaking is a viable alternative in some cases. However, snowflaking is rarely recommended simply to minimize disk space.

The snowflake schema might be used to support specific query needs, or it might be used when the data itself is not conducive to being easily denormalized. The decision to use a snowflake schema will often depend on the type of queries that will be supported and on the data being stored in the data warehouse. For instance, business intelligence (BI) applications that use a relational OLAP (ROLAP) architecture might perform better when the data warehouse schema is snowflaked.

Snowflake schema diagram. — Figure 3. Snowflake schema uses normalized data to organize data in a way that eliminates redundancy and helps reduce the amount of data.

Snowflaking should also be considered when evaluating the dimensions themselves. For example, you should consider using it in the following circumstances:

The dimension contains sparsely populated attributes whose values are mostly NULL.
The dimension supports a many-to-many relationship but limits the number of potential instances. For example, a data warehouse for a streaming video service might have to limit the number of categories that can be associated with each video if the dimension is denormalized.
The dimension is very large and full of redundant data in low cardinality attributes. For instance, a company's data warehouse might include a customer dimension that contains over a million rows of data, including a substantial amount of redundant geographical data.
The dimension includes low cardinality attributes that are queried independently. For example, a product dimension might contain thousands of records, but only a handful of product types. Moving the product type attribute to its own dimension table can improve performance when the product types are queried independently.
The dimension's attributes are part of a hierarchy and are queried independently, such as the year, quarter and month attributes of a date hierarchy or the country and state attributes of a geographic hierarchy.

Query performance will often drive the decision of whether to use a star schema or snowflake schema. However, data architects should still keep in mind other factors, such as data maintenance, data storage and the resources needed to develop and maintain the schema and queries.

Further explore the differences between star schema and snowflake schema and learn the differences between data lake vs. data warehouse.

This was last updated in July 2023

Continue Reading About snowflaking (snowflake schema)

Modernizing a data warehouse for real-time decisions

The differences between a data warehouse vs. data mart

On-premises vs. cloud data warehouses: Pros and cons

Weigh the benefits and drawbacks of a hybrid data warehouse

Best practices and pitfalls of the data pipeline process

Dig Deeper on Data integration

Business Analytics

Logi analytics suite to add new GenAI, SaaS capabilities
Insightsoftware, parent company of the embedded BI specialist, unveiled a new generative AI assistant and SaaS version of ...
Snowflake targets enterprise AI with launch of Arctic LLM
The data cloud vendor's open source LLM was designed to excel at business-specific tasks, such as generating code and following ...
AI-fueled efficiency a focus for SAS analytics platform
The vendor's latest product development plans include an AI assistant and prebuilt AI models that enable workers to be more ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

7 SharePoint problems that spur customers to leave the platform
SharePoint is a well-known content management and collaboration platform. Despite its popularity, it can introduce many ...
5 benefits of enterprise search
With a proper enterprise search strategy in place, organizations can improve their employees' efficiency and ensure customers ...
OpenText expands GenAI for enterprise content, IoT
OpenText finds a novel use for generative AI: combing through, sorting and summarizing massive amounts of IoT data. It also ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP earnings for Q1 indicate strong cloud growth
SAP's cloud revenue for the first quarter of 2024 indicates healthy growth and sets the stage as customers plan cloud migrations ...
SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...

Close