What is a Data WareHouse?
TL;DR: It’s a means for an organization to store and retrieve curated business decision making information, quickly, accurately, and with high levels of confidence, usually for reporting or visualization. It’s pre-defined structure allows for cross-departmental analysis of business data, that is usually pre-aggregated with already applied business logic and rules, with security provisioned profiles to enable a more data governed approach for scaling and enabling self-service analytics across the organization. Lastly, it makes any analysis a team already does in an Excel Spreadsheet or elsewhere exponentially faster and more consistent. A Data Warehouse is commonly referred to as the “single source of truth” for the organization. We recommend that it is always part of an organizations data-value-chain regardless of an organization’s understanding of a ‘Data Lake’. An “Enterprise Data Warehouse” (EDW) is a common term, and highest level of achievement, for most Data Warehouse implementations as it implies a holistic representation of the organization’s information.
Why have a Data Warehouse?
In our opinion, and as a fundamental approach to building a DataLakeHouse, every organization should have a Data Warehouse as part of the data-value-chain, regardless of the idea or understanding of a ‘Data Lake’. A (Enterprise) Data Warehouse enables:
- A single-source-of-the-truth for operational and managerial information and data, with curated perspectives on the data that represent the organization, including at a departmental owner perspective
- Well-defined Key Performance Indicators (KPIs) that drive, and provide the pulse of the business, to track how the organization/business is doing and enables alerting, thresholds, and other measurement abilities
- Corporate and business hierarchy representation of information – this can can be a direct abstract from operational systems (i.e.: ERP, CRM, POS) to codify and provide taxonomy that represents the organization not the default vendor representation of the transactional data. A common example for this is org structure, alternative chart of accounts or P&L structure, alternative business calendars, etc.
- Historical data storage and trend analytics (thus how the organization is performing over time?)
- Master Data Management (MDM) integration and validation, even potentially a “poor-man’s” MDM solution (depending on the organization’s business case requirements)
- Consolidation of multiple data source systems into a single data repository for analytics (for example, multiple GL systems, and/or HR systems) to gain a single source holistic view of the organization (perfect for acquisitions or mergers where decisions need to be made regarding the new formed organization as a whole)
What Questions Does a Data Warehouse Answer?
Since the Data Warehouse is structured based on common principles across industries, and customized completely to each individual organization’s business case needs, the answers a Data Warehouse can provide are near endless.
Examples of common answers a Data Warehouse seeks to provide are:
- What are our Top 20 Products?
- Who are our Top 10 Customers (Across geography, by product line)?
- How is our company performing by KPI Q-o-Q?
- What region are we declining by KPI?
- What product/menu item, etc. do our customers buy the most of between 2 and 6 pm, across the globe?
- What is the percentage of late arriving trucks this year?
- What is the average price of a product sourced from our supply chain vendors in China?
- What is the Net Cost of our manufactured product/part?
- What is the theoretical cost of menu item?
- How much inventory is available to ship this week and forecasted to ship next month based on current demand?
These types of answers are mainly possible because of the schema (or structure) that is defined as the general framework as well as the business use cases for the Data Warehouse. Often that is a combination of top-down and bottom-up leadership for an organizations data/information/analytics program. Some business cases are general across industry and some are specific to each industry, while adding some custom business cases for structuring the Data Warehouse per unique organization/business is always going to be part of the process of building such a solution.
Data Warehouse Structure and Architecture
A Data Warehouse is truly not that complex to get basic value from an organization’s data. Wisely built, with a solid foundation, it can scale to incorporate structure that provides analysis to all parts of the organization over time. By starting with a subset of the organization’s data, usually by selecting a single department, a Data Warehouse foundation can build fairly quickly (with the ultimate state in mind), and provide business value sooner rather than later, while keep an eye on the big picture, enabling expansion of other department’s data to be turned into insightful and consistent information.
This structure is often referred to as a dimensional model. Building the dimensional model requires understanding business use case requirements. Once the dimensional model is built, data which is usually normalized is taken through a technical data movement and cleansing process typically referred to as Extract, Transform, and Load (ETL). More advanced notions of ETL, such as Extract, Load, and Transform (ELT) are modern twists on the same general ETL. Either ETL or ELT, really just describes a data pipeline concept for formatting, cleansing, creating relationships in the data, moving data from one point (i.e.: source) to a destination point (i.e.: target).
The now ETL’d data, is loaded into the Data Warehouse dimensional model structure, usually in a batch process. Batch processing refers to the amount of data and timeliness of the data from a source system to the target system (i.e. EDW). Faster compute power, such as that of the Cloud, has enabled a Data Warehouse to be loaded in micro-batch, and even real-time (i.e.: near real-time) increments.
Common Data Warehouse Schema Types
A few very common structures for building a Data Warehouse and storing information to allow an organization’s information workers (i.e.: anyone in the organization that uses data), are Star Schemas and Snowflake Schemas. A few frameworks for building Data Warehouses exist as well such Data Vault, and others. Generally a star schema is a denormalized model which seeks to categorize the data so that it can be more readily aggregated, summarized, and referenced. This denormalization process whether virtual or physical, real-time or batch is what makes the Data Warehouse so special and allows for consistency and reliability for generating answers to business questions.
A star schema takes the attributes of the business data and stores them as perspectives of the data, called Dimensions. Common attribute of data would be: Time, Currency, Customer, etc. These are the perspectives, or points of view for slicing and dicing the data so a particular answer can be identified. Dimensions help answer what you want to see the data “by”. For example, “Show me new customer sales, by region, by product, by month”. In this example, region, product, and month would all be Dimensions in the Data Warehouse star schema structure to help answer the question and show the information.
Complementing the attributes are metrics (also known as ‘measures’), which identify how your organization measures the business. These metrics, which are the amounts, counts, averages, can come directly from the source systems or derived. The metrics are typically stored in the Data Warehouse structure as “Facts”. And, thus the table where these metrics/facts reference their relationships to attributes, which are stored in the dimensions, are referred to as Fact tables. This ability of structuring the attributes (i.e. dimensions) from the metrics (i.e.: facts), and combining them in a variety of ways through the automation of ETL, based on the business rules/logic is what gives a Data Warehouse its power for storing, analyzing, and having confidence in the data.
The idea of dimensions and facts are common concepts almost regardless of framework applied. Another schema type is known as a snowflake schema. It’s named after how the model may look in an entity relationsip diagram, resembling a snowflake pattern. But is is very similar to a star schema. The main difference is that there is more normalization that is part of the structure in how it is build. Historically this was seen as a potential performance issue, but again with modern compute power just about any performance degradation could be minimized, so the really difference boils down to maintenance, and number of structures to align with the business use case. Optimization of a data warehouse is a case-by-case situation, but best practices in initial design should always be taken into account, regardless of schema.
What about Cubes and OLAP?
Since most operational (ERP, GL, POS, CRM) and transactional database systems are referred to as OLTP (Online Transaction Processing) systems, Data Warehouses and systems that typically aggregate data for purposes of analytics or reporting are referred to as OLAP (Online Analytical processing) systems.
As a side note we believe that a ‘Data Lake’ and similar systems for object storage referencing should be referred to as OLSP (Online Storage Processing) systems.
Cube technology also known as Multidimensional Cubes are an extension of also referred to as multidimensional models or OLAP technology and is instrumental to broader analytics use cases. We believe one of the most instrumental aspects of cubes historically was the use of a language called MDX (Multi-dimension Expression) language which is similar to SQL for relational databases but used for cube technology specifically because a cube engine is capable of looking at the relationship of data differently than a traditional relational database which is the most common database system for a Data Warehouse. Again, with modern technology enhancements one could argue with columnar databases, etc. structures that data warehouses built today can perform just as well or better in most use cases when compared to cube technology.
How Much Maintenance Does a Data Warehouse Need?
The data-value-chain is part of a larger organization program made up of many projects. Projects may have a stop and start date while the program is ongoing for the life of the organization. As time goes on more data is collect, more decisions are made, more insights and lessons are learned, and more market advantages are required to keep your organization competitive.
Once the Data Warehouse is built, it can be considered as a completed version (ex: EDW v1); a company asset from its investment. As a project, one would hope the goal is that there is organization benefit from the investment. So, the more the EDW is used, socialized, and shows value, the more consideration for performance improvements, SLAs, and expansion into other departments.
Most organizations will either have a small team internally to support the build and maintenance of the Data Warehouse or a vendor that can provide support for these activities on a project-by-project basis , fractional capacity, etc.
What is the Average Time to Build a Data Warehouse?
Any technical or business consultant worth their salt will tell you that “it depends”. The reason for that is that building a Data Warehouse, a Data Lake, or any other part of your data-value-chain takes time. And it is based on your business requirements and expectations.
The response to this question should not be different from:
- How long will it take to implement our new General Ledger?
- How long will it take to convert our legacy CRM to the new one?
- How long will it take to add a a procurement system?
- How long will it take to close our books after merging all of our acquired companies?
Building a Data Warehouse from scratch for one department could take several weeks to a few months on average depending on the complexity. DataLakeHouse seeks to reduce this time exponentially by providing pre-built Data Warehouse solutions that fit across industry business cases.
Below is further explanation regarding a Data Warehouse though we feel most of the key detail is above. Please consider reading other commonly asked questions regarding underlying concepts which DataLakeHouse builds upon:
A Data Warehouse is a technical solution to answer functional business questions and also to provide a quick aggregate view of information based on a curated perspective.
The details pertaining to a Data Warehouse’s purpose, design, physical (meaning electronic footprint) structure, and means of delivery for consumption are subjective by nature as most organizations differ in purpose and objective. There are several truths for a Data Warehouse in that (1) it stores data in a denormalized structure (2) it stores historical data in a ready-to-use for analytical purposes form (3) it uses pre-defined dimensionality and factual metrics.
Two of the most prevalent structural frameworks for a Data Warehouse are the Dimensional Model and the Data Vault. Although different schools of thought on which Data Warehouse framework is better do exists, and there are zealots on both sides, the ultimate goals of delivering a solution that answers functional business questions remains the impetus. Maintenance of the Data Warehouse solution after its initial release and into the solution’s lifespan could persuade a data engineer on which framework to use depending on their inclination.
Contrary to a Data Warehouse which is known as an On-Line Analytical Processing (OLAP) solution, is a more traditional (or at least longer existing and potentially more prevalent solution) which is the On-Line Transactional Processing (OLTP) system. An OLTP is typically the more normalized design system which is behind many e-commerce systems and Enterprise Relationship Management (ERP) and Customer Relationship Management (CRM) systems.
OLTP databases are often relational databases that store information and process transactions for ecommerce sites, software as a service (SaaS) applications, or games. OLTP databases are usually optimized for transactions, which require the ACID properties: atomicity, consistency, isolation, and durability, and typically have highly normalized schemas. In contrast, data warehouses tend to be optimized for data retrieval and analysis, rather than transactions, and typically feature denormalized schemas. Generally, denormalizing data from an OLTP database makes it more useful for analysis..
A Data Warehouse is the process integrating data from multiple heterogeneous (separate and potentially un-linked and un-related) source to meet some business requirements, typically to create a market advantage, or simply to answer questions about the operational aspects of the business which are unable to easily be gleaned from a single ERP or CRM, etc. system. The Data Warehouse and this combined and aggregated look across the businesses data allows for structured and/or ad hoc queries, and decision making previously not possible.
The structure of the Data Warehouse aims to have both dimensions and facts/measures.
Facts: A fact is a measure that can have some mathematical operation applied to it across the many ways to look at the business. Typically, several business rules go into developing a fact measure so that it is meaningful to the business and answers functional questions. Facts are mainly divisible by different dimensions of the business
Dimensions: A dimension is easily communicated as a perspective of the business or organization. For example, business until, cost center, employee, product, customer, etc. These are all different aspects of the business which allow the business to ask questions such as “by” some aspect of the business using a fact measure as the unit which to be calculated. For example, let’s see Sales Order Dollars in Nevada ‘by’ Products shipped ‘by’ Carriers in California. In this case Nevada (State dimension), Products (Product dimension), Carriers (Carrier dimensions), and California (State dimension) table are structures of the Data Warehouse which contains the textual descriptors of the business. The fields of dimension table are designed to satisfy these two important requirements:
- Query constraining / grouping / filtering.
- Report labeling
As another example, if we were to examine “Sales Revenue by Product”, we would most probably see this structure: Sales Revenue (Fact) and Product (Dimension).
Schemas in Data Warehousing
Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Like a database, a data warehouse also requires maintaining a schema. A database uses relational model, while a data warehouse uses Star and Snowflake schema.
A schema model is really a logical description of the database or data warehouse. It will contain the name description columns and the data types for the structure including placeholders for attributes and metrics. So, data warehouse requires some level of design and curation in order for it to apply to one or more fairly specific business use cases. Oftentimes people most associated data warehouse model with a denormalized structure. This differs from a typical operational database design in 3rd normal form which is normalized so that it reduces repetition and provides a more object-oriented structure. The data warehouse models are typically classified as a dimensional model or a data vault model. Either way the model will consist of fact tables and dimensions and these models are also depicted or otherwise known as a star or snowflake dimensional model.
Star Schema: In Star Schema, the dimension is represented with only one-dimension table. Each dimension table is connected to only one Fact Table in star like structure.
Snowflake Schema: Unlike Star schema, the dimensions table in a snowflake schema are normalized. The normalization splits up the data into additional tables.
There’s lots more detail and explanation on what a Data Warehouse is and what it can do. Different schools of thought on this create endless debates on structure, size, scale, and segmentation of a Data Warehouse. It will always have a place in every organization where actionable insights and decisions need to be made. It dove tails in nicely with all concepts of Data Lake Storage, Machine Learning, and Analytics in the business regardless where you’re starting or how big your Big Data actually becomes.