The process of data warehousing is undergoing rapid transformation, giving rise to various new terminologies, especially due to the shift from the traditional ETL to the new ELT. For someone new to the process, these additional terminologies and abbreviations might seem overwhelming, some may even ask, “Why does it matter if the L comes before the T?”
The answer lies in the infrastructure and the setup. Here is what the fuss is all about, the sequencing of the words and more importantly, why you should be shifting from ETL to ELT.
Understanding Data Warehousing Processes
Data Warehouse or Enterprise Data Warehouse (EDW) is a system implemented for the purpose of reporting and data analysis. They are central repositories of integrated data from disparate sources used for generating reports.
The popular definition from Bill Inmon is, “It is a subject oriented, integrated, time variant and non-volatile collection of data used for decision making process.”
- Subject oriented: A data warehouse can be used to analyze a particular subject area.
- Integrated: A data warehouse integrates data from one or more disparate data sources.
- Time variant: Historical data is stored in a data warehouse.
- Nonvolatile: Once data is input in a data warehouse, it cannot be changed or altered.
What is ETL?
ETL stands for extraction, transformation and loading, and is the process of extracting data from the source system to the data warehouse. They are critical components for feeding a data warehouse, a business intelligence system or a big data platform.
The ETL processes are:
- Extraction: Extracts raw data into databases or storage systems
- Transformation: Simplifies data to reconcile it across source systems, perform analysis and enrich with external lookup information. This stage also matches the format required by the target system.
- Loading: Sourcing the resultant data into various business intelligence (BI) tools, data warehouse or EDW, etc.
Advantages of ETL:
- Single view interface to integrate heterogeneous data
- Ability to join data both at the source and at the integration server with the addition of the option to apply any business rule from within a single interface.
- Common data infrastructure for working on data movement and data quality.
- Parallel Processing Engine for providing exceptional performance and scalability.
Shortcomings of ETL:
- Migration from server to enterprise edition might require vast time and resources due to the innumerable architectural differences in the Server and Enterprise edition.
- No automated error handling or recovery mechanism.
- Expensive as a solution for small or midsized companies.
What is ELT?
Until recently, it was normal to stage data into an intermediate system before pushing it into the target system as the target was better optimized to retrieve and report (and not to perform hard crunching of numbers or data). This is why many preferred the ETL process, where the intermediate system would be optimized to perform calculations and data transformation (this is the reason we call this process transformation). This approach kept the target reporting system independent of the implementation method during transform stage, resulting in organizations implementing three separate systems to satisfy the requirements of each stage.
Since hardware systems today are better equipped and capable of doing a lot more, reporting and calculations can be performed using the same system. This is where the ELT implementation comes in.
ELT stands for Extract, Load, Transform. It is an alternative to ETL as it implements the data lake. In ELT models, data is processed on entry to the data lake, resulting in faster loading times. In most cases, the design of the transformational technology ties closely into the platform used for reporting, giving ETL the advantage of a better hardware and software sync up.
- No need for a separate transformation engine, the work is done by the target system itself.
- Data transformation and loading happen in parallel, so less time and resources are spent (as only filtered, clean data is loaded into the target system)
- ELT works with high-end data engines such as Hadoop cluster, cloud or data appliances. This gives is additional performance and security.
- The processing capability of data warehousing infrastructure reduces time that data spends in transit and makes the system more cost effective.
- The specifics of ELT development vary on platform i.e. Hadoop clusters work by breaking a problem into smaller chunks, then distributing those chunks across a large number of machines for processing. Some problems can be easily split, others will be much harder.
- Developers need to be aware of the nature of the system they’re using to perform transformations. While some systems can handle nearly any transformation, others do not have enough resources, requiring careful planning and design.
Comparison: ETL vs ELT
Although ETL and ELT are vastly different in terms of architecture and implementation, the main difference lies in the rethinking of approach taken to transferring data into reporting systems. ELT takes full advantage of technology and along the way enhances the reporting solution with added values like tracing of data points.
Another main attraction of ELT is the reduction in load time and the time that data is in transit, making it not just efficient but even cost-effective. Even though ELT requires a high-end system, it drastically reduces the number of components required. 
Thus, despite ELT implementation being more complex compared to the one way transaction-system-to-reporting ETL, ELT is now being preferred. Designing a proper ELT system might take some work, but the payoff is well worth it!
In banking terms, only the data of value ends up in the Data Warehouse for ETL processes. What this mean is that you Extract the needed data into a staging area (in relational term often staging tables or the so called global temporary tables), segregate it from unwanted data, perform data manipulation (Transformation) and finally Load it into target tables in a Data Warehouse. Analysts then use appropriate BI Tools to look at macroscopic trends in the data. This makes the process of data matching,
Read more about HEXANIKA’s DRAAS solution at: https://hexanika.com/company-profile/
This is where ELT works best, for it is not just confined to data deemed to be of specific value. Hadoop (HDFS systems) can store everything from structured data (transactional databases) and unstructured data (coming for excel sheets, emails, logs, internet, and other). As raw data and transformed data are saved on the same machine, data linkage and lineage processes are a lot faster and more accurate. This also drastically reduces the Total Cost of Ownership (TCO) which is an attractive proposition for various financial institutions using Big Data Storage systems.
In summary, ELT allows you to extract and load all data as is into HDFS, and then you can do Transformation through Schema on Read, thereby simplifying the process of Data Warehousing.
HEXANIKA: Efficient, Simple and Smart!
Hexanika is a FinTech Big Data software company, which has developed a revolutionary software platform called SmartJoinTM for financial institutions to address data sourcing and reporting challenges for regulatory compliance. SmartJoinTM improves data quality while the automated nature of SmartRegTM keeps regulatory reporting in harmony with the dynamic regulatory requirements and keeps pace with the new developments and latest regulatory updates.
Hexanika leverages the power of ELT using distributed parallel processing, Big Data/Hadoop technology with a secure data cloud (IBM Cloud). Understanding the high implementation costs of new systems and the complexities involved in redesigning existing solutions, Hexanika offers a unique build that adapts to existing architectures. This makes our solution cost-effective, efficient, simple and smart!
Read more about our solution and architecture at: https://hexanika.com/big-data-solution-architecture/
Contributor: Vedvrat Shikarpur
ELT/ETL Image Credits: IBM