Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

  • SQL Server Integration Services (SSIS)
  • Informatica PowerCenter
  • IBM Infosphere DataStage
  • Syncsort DMX / DMX-h
  • Oracle Data Integrator
  • Oracle Warehouse Builder
  • SAS Data Management
  • CloverETL

Open-source solutions are available as well, such as

  • Pentaho Data Integration (Kettle)
  • Talend Open Studio.

References:

5837 questions
12
votes
4 answers

How to move data from Glue to Dynamodb

We are designing an Big data solution for one of our dashboard applications and seriously considering Glue for our initial ETL. Currently Glue supports JDBC and S3 as the target but our downstream services and components will work better with…
Robby
  • 371
  • 2
  • 3
  • 15
12
votes
1 answer

Is it possible to pass parameters to a .dtsx package on the command line?

I am currently executing an SSIS package (package.dtsx) from the command line using Dtexec. This is as simple as: dtexec /f Package.dtsx However, I have some parameters that I would like to pass to the package for it to use during execution. The…
Hoppy
  • 720
  • 2
  • 12
  • 24
12
votes
1 answer

How to use NiFi ExecuteScript processor with Python?

I have a very basic setup of the ExecuteScript processor in Apache NiFi with a simple Python script (saved as a .py file) as shown here. In the Properties of the processor, I set the Script Engine to python and Script File to the path of this…
Mushu909
  • 1,194
  • 2
  • 11
  • 16
12
votes
4 answers

How does ETL (database to database) fit into SOA?

Lets imagine, that our application needs ETL (extract, transform, load) data from relation database to another relation database. Most simple (and most performance, IMHO) way is to make link between databases and write simple stored procedure. In…
nikolobok
  • 155
  • 1
  • 7
12
votes
7 answers

C# Importing Large Volume of Data from CSV to Database

What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database. The data needs to be formatted(e.g. name column needs to be split into first name and last name, etc.) I need to do this in a efficiently as…
guazz
  • 121
  • 1
  • 3
12
votes
2 answers

Replace fieldnames when using DictReader

I have a test.csv file: foo,bar,foobar,barfoo 1,2,3,4 5,6,7,8 9,10,11,12 And the following CSV parser: #!/usr/bin/env python # -*- coding: utf-8 -*- import csv import json f = open ( 'test.csv', 'r' ) reader = csv.DictReader( f, fieldnames = (…
cherrun
  • 2,102
  • 8
  • 34
  • 51
11
votes
4 answers

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will…
joshuahornby10
  • 4,222
  • 8
  • 36
  • 52
11
votes
4 answers

What is the definition of realtime, near realtime and batch? Give examples of each?

I'm trying to get a good definition of realtime, near realtime and batch? I am not talking about sync and async although to me, they are different dimensions. Here is what I'm thinking Realtime is sync web services or async web services. Near…
Albert T. Wong
  • 1,535
  • 1
  • 13
  • 21
11
votes
5 answers

SSIS File System Task Error while copying files between servers

I can copy files between two servers say Server A and Server B manually and I have permissions to folders on either side. I am using File System Task to Copy files. When my Source and Destination are within the Server the Package works fine in…
Manoj Nayak
  • 2,449
  • 9
  • 31
  • 53
11
votes
3 answers

Print a postgresql table to standard output in python

I have a table in postgresql named mytable and I need to print the contents of this table from a python application to stdout. I'm currently doing the following: conn = psycopg2.connect("dbname=postgres user=postgres password=psswd") …
blabla
  • 303
  • 1
  • 4
  • 13
11
votes
5 answers

Straight Java/Groovy versus ETL tool (Talend/etc) - what libraries would you use?

Assume you have a small project which on the surface looks like a good match for an ETL tool like Talend. But assume further, that you have never used Talend and furthermore, you do not trust "visual programming" tools in general and would rather…
Alex R
  • 11,364
  • 15
  • 100
  • 180
11
votes
3 answers

What are the required functionalities of ETL frameworks?

I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ? This ETL will be as general purpose as possible, with a scriptable and modular approach.…
edomaur
  • 1,397
  • 4
  • 23
  • 38
11
votes
3 answers

How to add third party libraries to Talend project?

How to add third party libraries (jar files) to a Talend project ? One more question is, Each Talend component uses LogFactory, but in my case it is throwing java.lang.NoClassDefFoundError: org.apache.commons.logging.LogFactory at ... with…
RP-
  • 5,827
  • 2
  • 27
  • 46
10
votes
5 answers

What is the best way to save XML data to SQL Server?

Is there a direct route that is pretty straight forward? (i.e. can SQL Server read XML) Or, is it best to parse the XML and just transfer it in the usual way via ADO.Net either as individual rows or perhaps a batch update? I realize there may be…
alchemical
  • 13,559
  • 23
  • 83
  • 110
10
votes
2 answers

APACHE NIFI vs APACHE AIRFLOW vs APACHE FALCON ? Which suits best in the below scenario?

I am developing a solution in Java which communicates with a set of devices through REST APIs which belongs to different vendors. So for each vendor, there are a set of processes that I have to perform inside my solution. However, these processes…
Selaka Nanayakkara
  • 3,296
  • 1
  • 22
  • 42