Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

SQL Server Integration Services (SSIS)
Informatica PowerCenter
IBM Infosphere DataStage
Syncsort DMX / DMX-h
Oracle Data Integrator
Oracle Warehouse Builder
SAS Data Management
CloverETL

Open-source solutions are available as well, such as

Pentaho Data Integration (Kettle)
Talend Open Studio.

References:

5837 questions

votes

1 answer

How to implement the below query using ssis

Can we discuss how to load the below query result into a destination table ,using ssis.I know we can use this in T-sql and also as an OLEDB source query.But still wondering how to implement it ,only using data flow components SELECT CLIENTID …

asked Nov 23 '16 at 15:46

user1254579

3,901
21
65
104

votes

1 answer

How to do a simple Talend Upsert with timestamp column

I'm trying to create a Talend job which will do the following: Take input rows and insert or update into an oracle table If the row already exists in Oracle table, check if it needs updating based on primary key matching. If the row does not…

oracle etl talend

asked Nov 22 '16 at 16:41

Paul

1,483
14
32

votes

2 answers

Manipulating Data Within AWS Redshift to a Schedule

Current Setup: SQL Server OLTP database AWS Redshift OLAP database updated from OLTP via SSIS every 20 minutes Our customers only have access to the OLAP Db Requirement: One customer requires some additional tables to be created and populated to…

etl amazon-redshift amazon-data-pipeline

asked Nov 06 '16 at 23:48

Simon1979

2,110
1
12
15

votes

3 answers

Talend and Apache Spark?

I am confused as to where Talend and Apache spark fit in the big data ecosystem as both Apache Spark and Talend can be used for ETL. Could someone please explain this with an example?

hadoop apache-spark etl talend data-integration

asked Nov 02 '16 at 01:49

user2803194

votes

1 answer

Creating real time datawarehouse

I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to…

etl data-warehouse business-intelligence

asked Oct 16 '16 at 21:31

mrc

2,845
8
39
73

votes

1 answer

OrientDB ETL Edge transformer 2 joinFieldName(s)

with one joinFieldName and lookup the Edge transformer works perfect. However, now two keys is required, i.e. compound index in the lookup. How can two joinFieldNames be specified? This is the scripted(post processing) version: Create edge Expands…

orientdb etl

asked Sep 15 '16 at 18:16

user5598747

votes

1 answer

Load bulk JSON data into SQL Server table

What is the standard way of loading a bulk of JSON data from a file into a database table in an SQL Server version less than 2016? I am aware of OPENJSON() function that was introduced in SQL Server 2016. However, I am limited to use SQL Server…

sql-server json ssis sql-server-2014 etl

asked Sep 01 '16 at 13:43

Akshay Rane

votes

1 answer

Reverse engineering DataStage code into Pig (for Hadoop)

I have a landscape of datastage applications which I want to reverse engineer into Pig... Rather than having to write fresh Pig code and try to replicate the datastage functionality. Has anyone had experience of doing something similar? Any tips on…

hadoop apache-pig reverse-engineering etl datastage

asked Aug 25 '16 at 13:20

Steve

votes

1 answer

What could be incorrect in this multidimensional model?

I want to build a simple multidimensional data model by using the star schema in a relational database (ROLAP). For that I create a fact table and two tables of dimensions. Firstly I copy the data from the operational source and handle this data…

etl data-warehouse olap mondrian rolap

asked Aug 15 '16 at 16:05

user1134181

votes

0 answers

Importing data using SSIS using a Web Service (https) as destination

Trying to set up a data import from SQL Server to the cloud (destination Web Service Task), only the destination requires a SSL certificate (https). is there a way to natively bypass this SSL security check using the Web Service Task? (i.e.…

ssl https ssis etl

asked Jul 27 '16 at 20:17

dirtyw0lf

1,899
5
35
63

votes

3 answers

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents). We're wondering what's the best approach for this: A single query with no filtering to open a cursor and get documents in batches of…

mongodb etl

asked Jul 17 '16 at 11:03

Ron

1,806
3
18
31

votes

0 answers

How do I configure OrientDB ETL to import an edge list with attributes

I have an CSV which contains an edge list, one edge per row. It looks like this: id1, id2, attr1, attr2, attrX, attrY, attrZ From this, I want to be able to create (or update) the following, per row: Vertex A of class X, with id1 and attribute attr1…

graph orientdb etl

asked Jul 08 '16 at 13:50

G. Hirpara

votes

2 answers

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows. My code is as follows. import docx as dc doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx') Through another…

python io ms-word etl python-docx

asked May 26 '16 at 14:18

Jstuff

1,266
2
16
27

votes

1 answer

Using ETL (non-MS) to get data from Infopath forms stored in Sharepoint 2007

I'm looking at the architecture for a DW project and there will be the need for some manual collection of [structured] data eg the monthly accounting results from a country manager where they need to complete a form and fill in half a dozen values…

sharepoint infopath etl

asked Sep 19 '10 at 11:02

MarkH

votes

1 answer

Rebuild Informatica persistent cache only when data is updated in the table

I have requirement to build a persistent cache in Informatica for a large table due to performance issue. My requirement is build this cache only when there is some change in the table(database is oracle). In my case table updates data…

oracle etl informatica informatica-powercenter

asked May 15 '16 at 04:00

Pravin Satav

Prev 1 2 3

…

99 100 Next