TL;DR: Extracting, Transforming, and Loading (ETL) or even better, especially in cloud-bases systems Extract, Load, and Transforming (ELT) data from a source to a target system requires technology. Historically for simple one-off data movements or transformations, batch/bash scripts line-by-line code sufficed, but as more complex initiatives requiring larger sets of resources to work together evolved, ETL/ELT management systems were introduced which often provided developers with a component-based GUI ETL/ELT solution requiring minimal or no need for hand-written code to produce a data pipeline. More recent compute advancements and cloud innovations have brought on a renaissance of once again using code to meet not only simple ETL/ELT and data pipeline processing needs but even more complex ones such as coding ingest for a Data Lake, building a Data Warehouse, providing sources for Data Science and repeatable Machine Learning models and data sets, etc.
Refresher: What is ETL/ELT?
It is the generally recognized method in which data is moved from one or more sources into one or more destination/target systems and/or formats. The concept has been around since the 1970’s, as is probably best known and most often use in Data Warehouse development.
Extracting the data from one or more heterogeneous (i.e: different kind) or homogeneous (i.e.: same kind) sources requires identifying the source(s), the timeliness and availability of the sources, and any relationships.
Transforming the data requires understanding any business rules, business logic, and the like in order to convert, cleans, or otherwise make sense of the extracted data. This can be a recursive effort or a direct one depending on complexity, so often a staging area is used in transforming the data, thus often the first step is extracting the data to a half-way point outside of the source system(s), but not yet at the final destination.
Loading the data once the data is transformed into the target system(s) can be sometimes be deceptive. Although it is should be simply placing data into a location, one has to account for data that may already exist in the target system(s) to avoid overwriting data for example that may not be replaceable, as well as other challenges. Speed can also be a factor for loading data, as well as performance for updating or inserting records. Typically at the end of the load process the ETL developers write logging logic to notify issues or completion of tasks.
Since all ETL requires some form of technology, and multiple technology solutions exist, it is often questioned what ETL solution to use to accomplish a single concerted goal. This is especially the case amongst multiple developers who are working towards that specific purpose and must interact with one another and one another’s code in a co-development landscape.
Why have ETL/ELT as Code (ETLaC or EaC)?
Whether you call it ETL, ELT, or a Data Pipeline, the concepts accomplish basically the same goal: to move data from one place to another while cleaning the data along the way. The argument is that a recipient of the data, perhaps a business analyst or a company executive, doesn’t care about the intricacies of how you got them the data, just as long as the data is correct!
ETL as Code now comes through many different vendor and open source flavors. Most follow a concept of a Direct Acyclic Graph (DAG) programming model. The DAG coding construct aims to ensure that developers creating a pipeline that flows data only in one direction. Though it can take many paths, all directions of the code flow are forward towards the target destination end of the pipeline.
Most new frameworks are based on the Python programming language. Conceptually the idea is that programming the ETL/ELT manually gives 100% flexibility to customize the pipeline logic and how the transformations and business logic is crafted – basically no boundaries. The DAG framework are basically guard rails to keep the code moving forward to create a pipeline. The DAG Framework also allows developers to agree on a solution for the ETL/ELT which is supported, or has documentation, so that the ETL/ELT as Code framework can be learned and common principles shared so that the solution produced by the developers can be maintained.
There are obvious benefits to using ETL as Code, and if an entire team or organization can provided the Software Development Lifecycle to the process similar to how a team would for any product development and code release strategy then success can be achieved. Though some would argue that more time is spent setting up infrastructure, others would argue that the trade off is that the solution is more maintainable and follows best practices previous ETL solutions only wish they had had.
While several vendors ETL as Code / DAG concepts are programmatic scripted code oriented, several are or have expanded to including a GUI drag and drop interface which then automagically creates the code in the background which you may then have the choice of directly editing as if you had written it yourself.
What is ELT? How is ELT different from ETL?
Often called a variation of ETL, ELT (Extract, Load, and Transform) takes a slightly different approach to moving data through the pipeline. Instead of pulling data from the source system, in most cases larger volume of data source systems, incrementally then transforming the data in a separate ETL server (as some historical managed ETL systems once famed themselves), an ELT process, bulk loads the data to a staged area in or on the target system, and the transformation logic is applied in the staged area on the system, then the transformed logic is moved into the target system(s) final destination. Basically ELT is offloading as much of the transformation logic as possible out of the pipeline engine and onto the target, which usually has more processing capability, unless your target is an Excel spreadsheet(CSV). ELT was conceived for performance purposes and movement away from historical ways of thinking about ETL and infrastructure limitations.
What Solution Does ETL as Code Provide?
Some argue that ETL as Code provides more value to modern data pipeline and data integration developers than other ETL tools. Reasons for this include that only ETL as Code frameworks allow for:
- Leveraging existing programming skill, ex: Python
- Endless libraries and components vs. vendor lock-in components
- Stop and Start Where Left Off capability
- Capability to handle all (any type) of pipeline workloads
- Source Control Management
- Continuous Integration / Continuous Delivery
DataLakeHouse provides pre-built ETL as Code to support data integration of Data Warehouse and other business-value capability to extend any organizations data-value-chain.
By taking advantage of skills that most data analysts, data scientist, and other data integration specialist can quickly contribute to, it reduces the barrier of reach to engage the largest footprint of contributors and users of a solution. Combined with most cloud vendors supporting some form of open source or managed DAG pipeline capability, it is a choice which benefits some aspects of data integration more strongly than others.
ETL as Code Structure and Architecture
ETL as Code Structure and Architecture are important. The value of a high-performing, scalable ETL as Code structure is very high for mission critical applications. If a single node ETL as Code orchestration system fail in the middle of a critical nightly or intra-day pipeline process occurs, it could cause decisions to be delayed and time wasted.
Fortunately most ETL as Code systems, such as Apache Airflow for example, have the ability to start off as a single node architecture and expand fairly easily into a multi-node cluster. There are several vendor managed solutions as well that take care of the infrastructure for your development team so they can focus on the coding aspect.
Code structure will be important for purposes of code maintenance, deployment, etc. and there are some best practices to follow. Creating a good ETL process can be based on Source, Staging, and Target. For example controlling ETL tasks to identify the one to many structure between:
- Source Dependent Extracts > Tasks
- Health Check Assessment > Tasks
- Source Independent Load > Tasks
- Post Load Processing > Tasks
Those principles then need to align with where and with what connectors/plugins the ETL as Code to be written will be stored. And while that structure can have some basic foundational concepts, your mileage may vary (YMMV) but at least a baseline should give consistent practices to the development teams efforts.
ETL as Code Best Practices
While best practices should always be considered, many of the best practices for traditional ETL still apply. However in code, the best practices are both code and framework sensitive, and the nature of the target/destination also come in to play. DataLakeHouse ETL as Code seeks to create and use best practices crafted from the best developers in the world. Take a look at the source code repositories to dig in to the initial concepts by learning as you implement the code.
- What is OLAP?
- Excellent Explanation of AirFlow and DAG by Dustin Stansbury