Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

  • SQL Server Integration Services (SSIS)
  • Informatica PowerCenter
  • IBM Infosphere DataStage
  • Syncsort DMX / DMX-h
  • Oracle Data Integrator
  • Oracle Warehouse Builder
  • SAS Data Management
  • CloverETL

Open-source solutions are available as well, such as

  • Pentaho Data Integration (Kettle)
  • Talend Open Studio.

References:

5837 questions
2
votes
1 answer

How to implement the below query using ssis

Can we discuss how to load the below query result into a destination table ,using ssis.I know we can use this in T-sql and also as an OLEDB source query.But still wondering how to implement it ,only using data flow components SELECT CLIENTID …
user1254579
  • 3,901
  • 21
  • 65
  • 104
2
votes
1 answer

How to do a simple Talend Upsert with timestamp column

I'm trying to create a Talend job which will do the following: Take input rows and insert or update into an oracle table If the row already exists in Oracle table, check if it needs updating based on primary key matching. If the row does not…
Paul
  • 1,483
  • 14
  • 32
2
votes
2 answers

Manipulating Data Within AWS Redshift to a Schedule

Current Setup: SQL Server OLTP database AWS Redshift OLAP database updated from OLTP via SSIS every 20 minutes Our customers only have access to the OLAP Db Requirement: One customer requires some additional tables to be created and populated to…
Simon1979
  • 2,110
  • 1
  • 12
  • 15
2
votes
3 answers

Talend and Apache Spark?

I am confused as to where Talend and Apache spark fit in the big data ecosystem as both Apache Spark and Talend can be used for ETL. Could someone please explain this with an example?
user2803194
  • 105
  • 2
  • 11
2
votes
1 answer

Creating real time datawarehouse

I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to…
mrc
  • 2,845
  • 8
  • 39
  • 73
2
votes
1 answer

OrientDB ETL Edge transformer 2 joinFieldName(s)

with one joinFieldName and lookup the Edge transformer works perfect. However, now two keys is required, i.e. compound index in the lookup. How can two joinFieldNames be specified? This is the scripted(post processing) version: Create edge Expands…
user5598747
2
votes
1 answer

Load bulk JSON data into SQL Server table

What is the standard way of loading a bulk of JSON data from a file into a database table in an SQL Server version less than 2016? I am aware of OPENJSON() function that was introduced in SQL Server 2016. However, I am limited to use SQL Server…
Akshay Rane
  • 403
  • 4
  • 13
2
votes
1 answer

Reverse engineering DataStage code into Pig (for Hadoop)

I have a landscape of datastage applications which I want to reverse engineer into Pig... Rather than having to write fresh Pig code and try to replicate the datastage functionality. Has anyone had experience of doing something similar? Any tips on…
Steve
  • 21
  • 2
2
votes
1 answer

What could be incorrect in this multidimensional model?

I want to build a simple multidimensional data model by using the star schema in a relational database (ROLAP). For that I create a fact table and two tables of dimensions. Firstly I copy the data from the operational source and handle this data…
user1134181
2
votes
0 answers

Importing data using SSIS using a Web Service (https) as destination

Trying to set up a data import from SQL Server to the cloud (destination Web Service Task), only the destination requires a SSL certificate (https). is there a way to natively bypass this SSL security check using the Web Service Task? (i.e.…
dirtyw0lf
  • 1,899
  • 5
  • 35
  • 63
2
votes
3 answers

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents). We're wondering what's the best approach for this: A single query with no filtering to open a cursor and get documents in batches of…
Ron
  • 1,806
  • 3
  • 18
  • 31
2
votes
0 answers

How do I configure OrientDB ETL to import an edge list with attributes

I have an CSV which contains an edge list, one edge per row. It looks like this: id1, id2, attr1, attr2, attrX, attrY, attrZ From this, I want to be able to create (or update) the following, per row: Vertex A of class X, with id1 and attribute attr1…
G. Hirpara
  • 21
  • 2
2
votes
2 answers

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows. My code is as follows. import docx as dc doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx') Through another…
Jstuff
  • 1,266
  • 2
  • 16
  • 27
2
votes
1 answer

Using ETL (non-MS) to get data from Infopath forms stored in Sharepoint 2007

I'm looking at the architecture for a DW project and there will be the need for some manual collection of [structured] data eg the monthly accounting results from a country manager where they need to complete a form and fill in half a dozen values…
MarkH
  • 136
  • 5
2
votes
1 answer

Rebuild Informatica persistent cache only when data is updated in the table

I have requirement to build a persistent cache in Informatica for a large table due to performance issue. My requirement is build this cache only when there is some change in the table(database is oracle). In my case table updates data…
Pravin Satav
  • 702
  • 5
  • 17
  • 36