Data warehouse
A data warehouse is a centralized place where data from many different sources can be stored.
An ETL model separates data in the warehouse based on whether they have already been extracted, transformed or loaded.
An ELT model first loads the data into the warehouse and transforms the data after it's been loaded.
Why it's useful
[change | change source]A data warehouse needs effort to be created alongside the source storage.
- Allows to view data from multiple sources in one place
- It doesn't slow down the source storage while running queries
- Allows to check data history
As opposed to a datamart, a data warehouse stores data for the whole company, not just for a given topic in the company.
Attribute | Data warehouse | Data mart |
---|---|---|
Scope of the data | enterprise-wide | department-wide |
Number of subject areas | multiple | single |
How difficult to build | difficult | easy |
How much time takes to build | more | less |
Amount of memory | larger | limited |
Information storage
[change | change source]Facts
[change | change source]A fact is a value.
Dimensional versus normalized approach for storage of data
[change | change source]There are three or more leading approaches to storing data in a data warehouse – the most important approaches are the dimensional approach and the normalized approach.
Dimensional approach
[change | change source]A fact is concrete piece of data. Dimensions are the metadata about the fact.
Separation between data (facts) and contextual metadata (dimensions) makes the data warehouse simpler.
Some disadvantages of the dimensional approach are the following:
- When loading data from different systems some dimensions may be missing.
- If organization changes its business, both facts and dimensions may change . Changing to new structure requires either supporting reading from both old and new structure or converting old structure to the new structure.
Normalized approach
[change | change source]In the normalized approach, the data in the data warehouse are stored partially according to database normalization rules. Normalization removes data duplication by splitting each piece of data into entities. Each entity (e.g. customer, product, sale) relates to other entities. Reading a piece of data requires joining these entities together making reads slower.