Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

  • SQL Server Integration Services (SSIS)
  • Informatica PowerCenter
  • IBM Infosphere DataStage
  • Syncsort DMX / DMX-h
  • Oracle Data Integrator
  • Oracle Warehouse Builder
  • SAS Data Management
  • CloverETL

Open-source solutions are available as well, such as

  • Pentaho Data Integration (Kettle)
  • Talend Open Studio.

References:

5837 questions
2
votes
3 answers

SQL to get only the mismatches columns

I wanted to know if we can get only the mismatched dataset from a table while doing volume testing please look at the below example: If the below table is my target table which I perform my testing if the data is correctly loaded, I would write a…
2
votes
1 answer

Recommended ways to load large csv to RDB like mysql

Aim : Build a small ETL framework to take a Huge CSV and dump it into RDB(say MySQL). The current approach we are thinking about is to load csv using spark into a dataframe and persist it and later use frameworks like apache scoop and and load it…
Abhishek
  • 432
  • 5
  • 19
2
votes
0 answers

Left join Lists with Linq and detect multiple matches

My goal is to compare to lists and update values on list A with values from list B. Along with that, I want it to work like a left join and keep all values from list A even if they didn't get updated and detect multiple matches. What I have tried so…
Matt Rowland
  • 4,575
  • 4
  • 25
  • 34
2
votes
2 answers

Execution time of ETL packages - script

I use below script to calcute execution time of ETL packages: DECLARE @DATE DATE = GETDATE() - 7 SELECT [folder_name] ,[project_name] ,[package_name] ,CAST([start_time] AS datetime) AS [start_time] ,DATEDIFF(minute,…
Kulis
  • 988
  • 3
  • 11
  • 25
2
votes
0 answers

MongoDB ETL into HDFS

I want to be able to move my MongoDB production data to HDFS through a continuous ETL pipeline so as to be able to run Spark/MR jobs on it. I know that Hadoop MongoDB Connector exists to read/write data from/to MongoDB, but I don't want to incur…
Vijay Kansal
  • 809
  • 12
  • 26
2
votes
0 answers

Update Vertex field values using OrientDB ETL

I am trying to upload a simple csv into an Orient DB instance. My requirement is to update the existing vertex if the "Name" attribute is already present, else create a new record. Even though the ETL executes without any exception, the IsDown…
2
votes
1 answer

What is Master Data?

This is a terminology question in the context of business intelligence, ETL data integration, and data warehousing. I've always learned that the data warehouses have transactional "fact tables" which reference "dimension tables". However, some other…
user45867
  • 887
  • 2
  • 17
  • 30
2
votes
1 answer

How to design audit dimension in the data mart using SQL Server 2012

Goal I aim to create SSIS (ETL) Template that enables audit functionality (Audit Dimension). I've discovered a few ways to implement audit dimension that are described below with some reference links below: SEQUENCE Primary Key Best way to get…
BI Dude
  • 1,842
  • 5
  • 37
  • 67
2
votes
3 answers

SQL Server Integration Services - Incremental data load hash comparison

Using SQL Server Integration Services (SSIS) to perform incremental data load, comparing a hash of to-be-imported and existing row data. I am using this: http://ssismhash.codeplex.com/ to create the SHA512 hash for comparison. When trying to compare…
Bryan
  • 21
  • 1
  • 2
2
votes
2 answers

Incremental update of Table

I am copy data from SQL Server to Oracle with a SSIS package, and I want to improve performance of copying with incremental updating. My idea is to : Append new records of table.(delete them before then append) Delete "Deleted" records Code: drop…
Vitaly Ascheulov
  • 182
  • 2
  • 15
2
votes
2 answers

Handling ETL Failures in Talend

I have several Servers and I have downloaded all files from all servers. sometimes, one of servers doesn't work properly and the ETL is unable to download files from the faulty server. So, ETL stops and it must be run manually again. I would like to…
Hassan
  • 43
  • 3
  • 6
2
votes
1 answer

SSIS - How to execute next Task after For Loop

Im trying to execute a SQL Task after a For Loop Container completes (i.e performs its iterations) but it just wont execute. I cant for the life of me work out why it wont execute as there are no errors. Any ideas? The Loop executes the correct…
Matt
  • 389
  • 1
  • 8
  • 15
2
votes
1 answer

Parse .ETL file in Azure Web Job

I have a bunch of .ETL crash files stored in Azure blob storage that I want to parse and store in a SQL table. So far the only way I've found to parse these .ETL files is to pass them into tracerpt and then parse the XML file outputted from…
joe_coolish
  • 7,201
  • 13
  • 64
  • 111
2
votes
2 answers

How to create empty .txt file using ssis without scripting

My first task is File system task. This task creates a folder. Now I'm not able to figure out how to create a blank text file in the folder just created by File system task. I have created 2 variables - mypath (c:\sample\files) and myfilename…
sam
  • 1,242
  • 3
  • 12
  • 31
2
votes
1 answer

Pentaho Data Integration transformation, loading fields from csv file (Java API)

I am trying to create simple transformation using Kettle Java API. Just two blocks, one to read from csv file and the other to write it into text file. Code: PluginRegistry.addPluginType(SpoonPluginType.getInstance()); …
Purple
  • 711
  • 2
  • 10
  • 19
1 2 3
99
100