Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

  • SQL Server Integration Services (SSIS)
  • Informatica PowerCenter
  • IBM Infosphere DataStage
  • Syncsort DMX / DMX-h
  • Oracle Data Integrator
  • Oracle Warehouse Builder
  • SAS Data Management
  • CloverETL

Open-source solutions are available as well, such as

  • Pentaho Data Integration (Kettle)
  • Talend Open Studio.

References:

5837 questions
15
votes
3 answers

ETL using Python

I am working on a data warehouse and looking for an ETL solution that uses Python. I have played with SnapLogic as an ETL, but I was wondering if there were any other solutions out there. This data warehouse is just getting started. Ihave not…
emilam
  • 195
  • 2
  • 8
15
votes
5 answers

SQL Server Management Studio: Import quietly ignoring 99.9% of data

The Problem i'm trying to import data into a table using SQL Server Management Studio's Import Data task. It only brings in 26 rows, out of the original 49,325. (Edit: That's where 99.9% comes from: (1-26/49325)*100 = 99.9% Using DTS in Enterprise…
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
14
votes
3 answers

Automate Version number Retrieval from .Dtsx files

Is there a way or query by which I can find the version number of SSIS packages (*.dtsx files)? I have the *.dtsx files in my Team Foundation Server for which I wanted to know. The manual way is to do a mouse right-click on the package and click…
saurabh sharma
  • 183
  • 1
  • 13
14
votes
2 answers

Airflow: pattern to run airflow subdag once

From the airflow documentation: SubDAGs must have a schedule and be enabled. If the SubDAG’s schedule is set to None or @once, the SubDAG will succeed without having done anything I understand the subdagoperator is actually implemented as a…
gnicholas
  • 2,041
  • 1
  • 21
  • 32
14
votes
5 answers

How to test (unit test) on ETL process?

I know several small companies do not do testing on ETL process, but that seems to be suboptimal from the perspective of software engineering. How do people usually do testing/unit test/functional test on ETL process?
Hello lad
  • 17,344
  • 46
  • 127
  • 200
14
votes
4 answers

Which is better, ETL or ELT?

Having spent some time working on data warehousing, I have created both ETL (extract transform load) and ELT (extract load transform) processes. It seems that ELT is a newer approach to populating data warehouses that can more easily take advantage…
Chris J
  • 9,164
  • 7
  • 40
  • 39
14
votes
6 answers

concurrent statistics gathering on Oracle 11g partiitioned table

I am developing a DWH on Oracle 11g. We have some big tables (250+ million rows), partitioned by value. Each partition is a assigned to a different feeding source, and every partition is independent from others, so they can be loaded and processed…
Francesco
  • 161
  • 1
  • 10
14
votes
4 answers

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but…
14
votes
1 answer

How To Remove non-alpha numeric, or non-numeric characters with Hive REGEXP_EXTRACT() Function

I've been trying to figure out how to remove multiple non-alphanumeric or non-numeric characters, or return only the numeric characters from a string. I've tried: SELECT regexp_extract('X789', '[0-9]', 0) FROM table_name But it returns '7', not…
user1152532
  • 697
  • 3
  • 7
  • 15
14
votes
4 answers

Load contents in text files to sqlite table?

I have simple text files that contain just normal texts. I was wondering if there is a way to load the text contents to a table in sqlite. So maybe I could Create table myTable(nameOfText TEXT, contents TEXT); And then put nameOfText into the first…
Sardonic
  • 441
  • 3
  • 8
  • 19
13
votes
1 answer

How to Map Input and Output Columns dynamically in SSIS?

I Have to Upload Data in SQL Server from .dbf Files through SSIS. My Output Column is fixed but the input column is not fixed because the files come from the client and the client may have updated data by his own style. there may be some unused…
Shahab Haidar
  • 625
  • 3
  • 11
  • 25
13
votes
1 answer

SSIS vs. Oracle Data Integrator

Currently I am a Data Engineer that works mainly with SSIS. While reading about the ETL tools available in the market, i found that Oracle has its own ETL tool called ODI (Oracle Data integrator). I searched for an unbiased comparison between the…
Hadi
  • 36,233
  • 13
  • 65
  • 124
13
votes
2 answers

Where is the msdb database for standalone SSIS Servers

I am working on a DWH application. The server I was granted access to only has SSIS installed with no database engine. I am getting errors like: The EXECUTE permission was denied on the object 'sp_ssis_listfolders', database 'msdb', schema…
Moslem Ben Dhaou
  • 6,897
  • 8
  • 62
  • 93
13
votes
3 answers

Reading CSV file some missing columns

I am trying to read in a CSV file into my VB.net application using the following code: While Not EOF(1) Input(1, dummy) Input(1, phone_number) Input(1, username) Input(1, product_name) Input(1, wholesale_cost) Input(1,…
charlie
  • 415
  • 4
  • 35
  • 83
13
votes
3 answers

COPY column order

I'm trying to use COPY with HEADER option but my header line in file is in different order than the column order specified in database. Is the column name order necessary in my file ?? My code is as below: COPY table_name ( …
sneha
  • 169
  • 1
  • 4
  • 11