What is the ETL tool explain the purpose of ETL?
Show
Using ETL as an Enabler for Data Warehouses, Data Hubs, and Data Lakesby Punit Pathak If you are familiar with databases, data warehouses, data hubs, or data lakes then you have experienced the need for ETL (extract, transform, load) in your overall data flow process. While there are a number of solutions available, my intent is not to cover individual tools in this post, but focus more on the areas that need to be considered while performing all stages of ETL processing, whether you are developing an automated ETL flow or doing things more manually. With that being said, if you are looking to build out a Cloud Data Warehouse with a solution such as Snowflake, or have data flowing into a Big Data platform such as Apache Impala or Apache Hive, or are using more traditional database or data warehousing technologies, here are a few links to analysis on the latest ETL tools that you can review (Oct 2018 Review -and- Aug 2018 Analysis. Keep in mind that if you are leveraging Azure (Data Factory), AWS (Glue), or Google Cloud (Dataprep), each cloud vendor has ETL tools available as well. Finally solutions such as Databricks (Spark), Confluent (Kafka), and Apache NiFi provide varying levels of ETL functionality depending on requirements. An Introduction to ETLETL is a type of data integration process referring to three distinct but interrelated steps (Extract, Transform and Load) and is used to synthesize data from multiple sources many times to build a Data Warehouse, Data Hub, or Data Lake. The most common mistake and misjudgment made when designing and building an ETL solution is jumping into buying new tools and writing code before having a comprehensive understanding of business requirements/needs. There are some fundamental things that should be kept in mind before moving forward with implementing an ETL solution and flow. Why Do We Need ETL?It is essential to properly format and prepare data in order to load it in the data storage system of your choice. The triple combination of ETL provides crucial functions that are many times combined into a single application or suite of tools that help in the following areas:
A basic ETL process can be categorized in the below stages:
A viable approach should not only match with your organization’s need and business requirements but also performing on all the above stages. Traversing the Four Stages of ETL — Pointers to Keep in Mind
The steps above look simple but looks can be deceiving. Let’s now review each step that is required for designing and executing ETL processing and data flows. Data Extraction and Data CleaningData SourceIt is very important to understand the business requirements for ETL processing. The source will be the very first stage to interact with the available data which needs to be extracted. Organizations evaluate data through business intelligence tools which can leverage a diverse range of data types and sources. The most common of these data types are:
First, analyze how the source data is produced and in what format it needs to be stored. Traditional data sources for BI applications include Oracle, SQL Server, MySql, DB2, Hana, etc. Evaluate any transactional databases (ERP, HR, CRM, etc.) closely as they store an organization’s daily transactions and can be limiting for BI for two key reasons:
Usage and LatencyAnother consideration is how the data is going to be loaded and how will it be consumed at the destination. Let’s say the data is going to be used by the BI team for reporting purposes, so you’d certainly want to know how frequently they need the data. Further, if the frequency of retrieving the data is very high but volume is low then a traditional RDBMS might suffice for storing your data as it will be cost effective. If the frequency of retrieving the data is high, and the volume is the same, then a traditional RDBMS could in fact be a bottleneck for your BI team. That type of situation could be well served by a more fit for purpose data warehouse such as Snowflake or Big Data platforms that leverage Hive, Druid, Impala, HBase, etc. in a very efficient manner. There are many other considerations as well including current tools available in house, SQL compatibility (especially related to end user tools), management overhead, support for a wide variety of data, among other things. Auditing your Source DataData auditing refers to assessing the data quality and utility for a specific purpose. Data auditing also means looking at key metrics, other than quantity, to create a conclusion about the properties of the data set. In short, data audit is dependent on a registry, which is a storage space for data assets. So, ensure that your data source is analyzed according to your different organization’s fields and then move forward based on prioritizing the fields. Analyzing Data ExtractionThe main objective of the extraction process in ETL is to retrieve all the required data from the source with ease. Therefore, care should be taken to design the extraction process to avoid adverse effects on the source system in terms of performance, response time, and locking. Steps to Perform Extraction
There are times where a system may not be able to provide the modified records detail, so in that case, full extraction is the only choice to extract the data. Make sure that full extract requires keeping a copy of the last extracted data in the same format to identify the changes. While using Full or Incremental Extract, the extracted frequency is critical to keep in mind. Challenges Faced During the Extraction ProcessOne of the challenges that we typically face early on with many customers is extracting data from unstructured data sources, e.g. text, emails and web pages and in some cases custom apps are required depending on ETL tool that has been selected by your organization. This can and will increase the overhead cost of maintenance for the ETL process. Second, the implementation of a CDC (Change Data Capture) strategy is a challenge as it has the potential for disrupting the transaction process during extraction. Many times the extraction schedule would be an incremental extract followed by daily, weekly and monthly to bring the warehouse in sync with the source. Extraction of data from the transactional database has significant overhead as the transactional database is designed for efficient insert and updates rather than reads and executing a large query. And last, don’t dismiss or forget about the “small things” referenced below while extracting the data from the source.
Combining all the above challenges compounds with the number of data sources, each with their own frequency of changes. Data Cleansing RequirementsData cleaning, cleansing, and scrubbing approaches deal with detection and separation of invalid, duplicate, or inconsistent data to improve the quality and utility of data that is extracted before it is transferred to a target database or Data Warehouse. With the significant increase in data volumes and data variety across all channels and sources, the data cleansing process plays an increasingly vital role in ETL to ensure that clean, accurate data will be used in downstream decision making and data analysis. A solid data cleansing approach should satisfy a number of requirements:
A workflow process must be created to execute all data cleansing and transformation steps for multiple sources and large data sets in a reliable and efficient way. Data Cleansing ProblemsData quality problems that can be addressed by data cleansing originate as single source or multi-source challenges as listed below: Potential Problems with Data from a Single Source
Potential Problems with Data from Multiple Sources
Data Cleansing ApproachWhile there are a number of suitable approaches for data cleansing, in general, the phases below will apply: Data AnalysisIn order to know the types of errors and inconsistent data that need to be addressed, the data must be analyzed in detail. For data analysis, metadata can be analyzed that will provide insight into the data properties and help detect data quality problems. There are two related approaches to data analysis. Data ProfilingAs data gets bigger and infrastructure moves to the cloud, data profiling is increasingly important. Data profiling, data assessment, data discovery, data quality analysis is a process through which data is examined from an existing data source in order to collect statistics and information about it. In this step, a systematic up-front analysis of the content of the data sources is required. Data profiling requires that a wide variety of factoring are understood including the scope of the data, variation of data patterns and formats in the database, identifying multiple coding, redundant values, duplicates, nulls values, missing values and other anomalies that appear in the data source, checking of relationships between primary and foreign key plus the need to discover how this relationship influences the data extraction, and analyzing business rules. Data MiningData mining, data discovery, knowledge discovery (KDD) refers to the process of analyzing data from many dimensions, perspectives and then summarizing into useful information. It also refers to the nontrivial extraction of implicit, previously unknown, and potentially useful information from data in databases. In actual practice, data mining is a part of knowledge discovery although data mining and knowledge discovery can be considered synonyms. Through a defined approach and algorithms, investigation and analysis can occur on both current and historical data to predict future trends so that organizations’ will be enabled for proactive and knowledge-driven decisions. Defining Transformations and Mapping RulesMany transformations and cleaning steps need to be executed, depending upon the number of data sources, the degree of heterogeneity, and the errors in the data. Sometimes, a schema translation is used to map a source to a common data model for a Data Warehouse, where typically a relational representation is used. First, data cleaning steps could be used to correct single-source instance problems and prepare the data for integration. Later in the process, schema/data integration and cleaning multi-source instance problems, e.g., duplicates, data mismatch and nulls are dealt with. Declarative query and a mapping language should be used to specify schema related data transformations and a cleaning process to enable automatic generation of the transformation code. VerificationThe transformation workflow and transformation definition should be tested and evaluated for correctness and effectiveness. Improving the sample or source data or improving the definition may be necessary. Multiple repetitions of analysis, verification and design steps are needed as well because some errors only become important after applying a particular transformation. TransformationExecution of transformational steps is required either by running the ETL workflow for loading and by refreshing the data in a data warehouse or during the period of answering the queries on multiple sources. Backflow and Cleaned DataAfter removal of errors, the cleaned data should also be used to replace on the source side in order improve the data quality of the source database. This process will avoid the re-work of future data extraction. Data TransformationOnce data cleansing is complete, the data needs to be moved to a target system or to an intermediate system for further processing. The transformation step in ETL will help to create a structured data warehouse. Transformation refers to the data cleansing and aggregation that prepares it for analysis. There are two approaches for data transformation in the ETL process.
Below, aspects of both basic and advanced transformations are reviewed. Basic Transformation
Advanced Transformation
Data LoadingIn this phase, extracted and transformed data is loaded into the end target source which may be a simple delimited flat file or a Data Warehouse depending on the requirement of the organization. There are two types of tables in Data Warehouse: Fact Tables and Dimension Tables. Once the data is loaded into fact and dimension tables, it’s time to improve performance for BI data by creating aggregates. AggregationIn order to design an effective aggregate, some basic requirements should be met. First, aggregates should be stored in their own fact table. Next, all dimensions that are related should be a compacted version of dimensions associated with base-level data. Finally, affiliate the base fact tables in one family and force SQL to invoke it. Aggregation helps to improve performance and speed up query time for analytics related to business decisions. An Effective Loading ProcessReferential Integrity ConstraintsReferential integrity constraints will check if a value for a foreign key column is present in the parent table from which the foreign key is derived. This constraint is applied when new rows are inserted or the foreign key column is updated. While inserting or loading a large amount of data, this constraint can pose a performance bottleneck. Hence, it’s imperative to disable the foreign key constraint on tables dealing with large amounts of data, especially fact tables. Make sure that the purpose for referential integrity is maintained by the ETL process that is being used. Other Considerations for Effective Loading
Below are the most common challenges with incremental loads. Challenges with Incremental Loads
A final note that there are three modes of data loading: APPEND, INSERT and REPLACE, and precautions must be taken while performing data loading with different modes as that can cause data loss as well. Final ThoughtsI hope this article has assisted in giving you a fresh perspective on ETL while enabling you to understand it better and more effectively use it going forward. It would be great to hear from you about your favorite ETL tools and the solutions that you are seeing take center stage for Data Warehousing. Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. Punit Kumar Pathak is a Jr. Big Data Developer at Hashmap working across industries (and clouds) on a number of projects involving ETL pipelining as well as log analytics flow design and implementation. He works with a group of innovative technologists and domain experts accelerating high value business outcomes for customers, partners, and the community. What is the purpose of ETL tool?ETL tools automate the extraction, transforming, and loading processes, consolidating data from multiple data sources or databases. These tools may have data profiling, data cleansing, and metadata-writing capabilities.
What is ETL tool example?As the data is loaded into the data warehouse or data lake, the ETL tool sets the stage for long-term analysis and usage of such data. Examples include banking data, insurance claims, and retail sales history.
What is the purpose of an ETL pipeline?An ETL pipeline is a set of processes to extract data from one system, transform it, and load it into a target repository. ETL is an acronym for “Extract, Transform, and Load” and describes the three stages of the process.
Where are ETL tools used?An ETL tool is an instrument that automates this process by providing three essential functions: Extraction of data from underlying data sources. Data transformation in order to meet the data model of enterprise repositories like data warehouses. Data loading into target destination.
|