What is a Data Lake?
TL;DR: Some controversy initially surrounded the purpose of a Data Lake. Ultimately most agree that it is a repository, on-premise and/or in the cloud, used for storage of an organization’s own or third party raw data. It accepts all data types, from basically any data sources, and stored until the data is ready to be consumed/uses. As such this storage is referred to as ‘Object’ or ‘Block’ storage, as any data type stored (ex: CSV, TXT, MP4, Parquet, Avro) is just classified generally as an object. It is often compared to a Data Warehouse as it is believed people early on conflated a generalized raw data storage location for data of any type with Hadoop File Systems (HDFS) and the sole purpose of analyzing Big Data or replacing a traditional Data Warehouse which would not scale to support the ever growing datasets Hadoop proved to process. A Data Lake is now often qualified with the suffix “storage”, delineating Data Lake Storage from other purported uses. There’s a good article on why Data Lake is short for ‘Data Lake Storage’.
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
Why have a Data Lake?
Soon the notion of Big Data will just mean ‘Data’, as in ‘Big’ becomes the new normal for data, so it’s just data and it is everywhere and mainly pedestrian, that is until someone determines how to use if for some value. The flow of data is not going to stop anytime soon. There is data in every transaction. One can even purchase data to assist with achieving goals an organization might have, and data is produced by objects once thought of as static and utilitarian. What hasn’t started for every organization is the way to capture, store, organize, protect, share, and produce value from that data. These are really 6 key principles of a Data Lake. Here are some key reasons to have a Data Lake:
- Data is actually an asset, so start collecting and storing it now
- Your organization is thirsty for data, so provide a means for approved users to get at the data how they need to best get at it
- Storage is inexpensive compared to other period of time, it may be an excellent way to offload data from legacy systems to reduce cost elsewhere
- It leads to other value-add initiatives, for example moving towards the a cloud or hybrid-cloud architecture, or it begins a newer conversation regarding cybersecurity in the organization, etc.
- It plants the seed for modernization of systems, ex: updating the legacy Data Warehouse or building a new one, reorganizing legacy data pipelines or old ETL to ETL as Code
Some organizations have had a general concept of a dumping ground for where data rests for a good deal of time before it became popular to call it a Data Lake. Some would argue that without having satisfied the 6 key principles one only has a Data Swamp – that is an unorganized near dysfunctional storage repository of data.
Remember a Data Lake is really just the storage aspect of a place for any type of data to be stored. Different vendors that provide the ability for Data Lake Storage enable different capabilities for managing the object storage. Some add detailed search, object security, REST API capability, etc. while some others add on data wrangling capability. This adds to the confusion of where the Data Lake starts and ends. DataLakeHouse attempts to fit that gap through educating the end-to-end lifecycle which includes Data Lake to Business Value, providing a recognized separation of duties that still confuse many.
Why the Confusion between Data Lakes and Data Warehouses?
Not every organization is as cutting edge as some of the household brands those of us in the field of technology hear about from reading the latest TechCrunch or Forbes articles. Companies like Uber, AirBnB, Amazon, must run fast and stay on the cutting edge to maintain and leap frog competitors. These companies are built on the principles or being leaders in the technology space, and although their business problems that are solved by technology are solved quickly and with unique solutions, other companies that are not household names struggle with many of the same issues, albeit potentially at lower data volumes.
So, for many of those companies (not covered by TechCrunch and the like), focused on running the business, they may or may not even have a Data Warehouse, let alone a Hadoop instance or a Data Lake. Since the buzz word Big Data was spread across headlines at tech conferences during the 2010’s, the most relevant connection to improving data analytics in an organization most were familiar with for the last 30 years was the Data Warehouse. Naturally, most people make inferences based on current understanding, so unfortunately many conversations and articles regarding a data lake as a place an organization can go for its answers to business questions (and all other questions) was compared to the historical notion of data warehouse.
Another common comparison is the conversation for storing structured versus unstructured data. Coming from only the land of relational databases supporting operational systems and potentially a data mart/warehouse, the unknown of what definition of unstructured data could be could cause confusion. This is similar to the comparable argument of schema on read versus schema on write for a system that would consume data from a Data Lake (not the Data Lake itself). These comparisons of the potential behind a Data Lake and what a Data Lake actually is again simply uses the Data Warehouse as a reference point. The value of a Data Warehouse is still strongly intact.
For anyone that has worked with Object Storage solutions, and the ingest and egress of data from Object Storage ‘buckets’, etc. it is clear to understand the difference between a Data Lake and a Data Warehouse. However, the comparison seemed to be further proliferated by writings expressing ‘Data Scientist’ and their need for data for machine learning, prediction, etc. that wasn’t possible or capable in a Data Warehouse. Again, the comparison, but without qualifying the distinction of purpose for the two types of systems.
Lastly, newer vendor systems, labeled as Cloud Data Warehouses do an amazing job as a Data Warehouse solution and provide optimal scalable computing power. Now several of the Cloud Data Warehouse vendors have enabled combined Data Warehouse, Machine Learning, and direct Object Storage to Data Warehouse data loading (psuedo-ETL) within their solutions. This comparison and even blending of a Data Warehouse concept with that historically of a Data Lake can add confusion. The DataLakeHouse project seeks to educate and elucidate these common misunderstanding so that every organization can achieve business value through a Big Data ecosystem. This latter example is one the DataLakeHouse project separates into Front Lake and Back Lake concepts so that organizations can properly align skillsets and achieve maximum potential for their Data Management investments.
How Much Maintenance Does a Data Lake Need?
Similar to any other infrastructure initiative in an organization, a Data Lake is in ongoing concern as an asset. Achieving direct business value and an ROI from the infrastructure set up and the associated inputs and outputs of the Data lake is subjective measurement quest. Maintaining the Data Lake will require individuals familiar with data management to understand the vendor used to support the object storage, and any associated networking and security controls in order to providing the access that makes a Data Lake value to the individuals in the organization that can potentially turn the data to information, thus business value.
How Long does it take to Build a Data Lake?
Similar to building a Data Warehouse, establishing a new CRM system, or any other project for the organization, building a Data Lake will take time and the amount of time is based on business case requirements. Like most technical projects there can be scale and growth to the solution. By starting with a small portion of a larger use case the foundational elements of a Data Lake can be architected. Then the initial business use case can produce a Minimally Viable Product (MVP) to provide out the concept in a pilot like implementation. From the proven result to meet some basic requirements the scale of the Data Lake can grow almost infinitely.
The largest gap or disconnect of how long one conceives it should take to build a Data Lake versus the reality is by defining a good business case. Because many business cases are similar, regardless of industry, the DataLakeHouse project seeks to enable all organizations with proven business cases and pre-built Data Lake to Business Value pipelines to accelerate the education and implementation of these Big Data solutions.