Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

  • SQL Server Integration Services (SSIS)
  • Informatica PowerCenter
  • IBM Infosphere DataStage
  • Syncsort DMX / DMX-h
  • Oracle Data Integrator
  • Oracle Warehouse Builder
  • SAS Data Management
  • CloverETL

Open-source solutions are available as well, such as

  • Pentaho Data Integration (Kettle)
  • Talend Open Studio.

References:

5837 questions
10
votes
0 answers

AWS Glue disable sslmode for target connections

Am quite new to AWS Glue; we are building an ETL process that pulls data from an external source on a MySQL database into Redshift. After adding the connections it and testing them it would connect successfully to the instance (without…
Mo J. Mughrabi
  • 6,747
  • 16
  • 85
  • 143
10
votes
2 answers

Data profiling Task - custom Profile Request

Is there any option to create a custom Profile Request for SSIS Data Profiling Task? At the moment there are 5 standard profile requests under SSIS Data Profiling task: Column Null Ratio Profile Request Column Statistics Profile Request Column…
Barsham
  • 749
  • 8
  • 30
10
votes
5 answers

Query a database based on result of query from another database

I am using SSIS in VS 2013. I need to get a list of IDs from 1 database, and with that list of IDs, I want to query another database, ie SELECT ... from MySecondDB WHERE ID IN ({list of IDs from MyFirstDB}).
faujong
  • 949
  • 4
  • 24
  • 40
10
votes
2 answers

Can Spark Replace ETL Tool

Existing process - raw structure data are copied into a staging layer of Redshift. Then use ETL tools such as Informatica, Telend to do incremental loading into Fact and Dimension table of Datamart/datawarehouse. All joins happen within database…
10
votes
2 answers

In Powershell, what's the most efficient way to split a large text file by record type?

I am using Powershell for some ETL work, reading compressed text files in and splitting them out depending on the first three characters of each line. If I were just filtering the input file, I could pipe the filtered stream to Out-File and be done…
Peter Radocchia
  • 10,710
  • 2
  • 34
  • 56
10
votes
2 answers

When to use non-blocking >! / threads and blocking >!! / goroutines with clojure core.async

I'm writing a an ETL process to read event level data from a product database, transform / aggregate it and write to to an analytics data warehouse. I'm using clojure's core.async library to separate these process into concurrently executing…
Sean Geoffrey Pietz
  • 1,120
  • 2
  • 12
  • 17
10
votes
6 answers

Data in different resolutions

I have two tables, records are being continuously inserted to these tables from outside source. Lets say these tables are keeping statistics of user interactions. When a user is clicking a button the details of that click (the user, time of click…
nimcap
  • 10,062
  • 15
  • 61
  • 69
9
votes
3 answers

How to classify/categorize strings according to regular expression rules in Python

I am writing an ETL script in Python that gets data in CSV files, validates and sanitizes the data as well as categorizes or classifies each row according to some rules, and finally loads it into a postgresql database. The data looks like this…
ervingsb
  • 653
  • 8
  • 9
9
votes
3 answers

Open Source ETL framework

I was asked to prototype two ETL frameworks. The requirements are as follows: Open Source Available to Linux Maintained Logs can be viewed on web browser (nice to have) Written in Perl, Python, Ruby or Java The raw file can be anything (excel,…
jchips12
  • 1,177
  • 1
  • 11
  • 27
9
votes
2 answers

How do I split flat file data and load into parent-child tables in database?

I have denormalized data (coming from a file) that needs to be imported into parent-child tables. The source data is something like this: Account# Name Membership Email 101 J Burns Gold alpha@foo.com 101 J…
Brent Arias
  • 29,277
  • 40
  • 133
  • 234
9
votes
2 answers

Adding multiple partitioned columns to BigQuery table from SQL query

I've been trying to add multiple partition columns, to a BigQuery table, but it seems to only take one field, even if I add multiple partition fields in the query parameters. I'm partitioning by date time and integer range. It only takes the later…
Badruddin Kamal
  • 103
  • 1
  • 1
  • 3
9
votes
2 answers

Constraint database

I know the intuition behind constraint programming, so to say I never really experienced programming using a constraint solver. Although I think it is a different situation to be able to achieve what we would define as consistent data. Context: We…
Curcuma_
  • 851
  • 2
  • 12
  • 37
9
votes
2 answers

I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue

I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter…
hgpestana
  • 409
  • 1
  • 4
  • 11
9
votes
0 answers

SSIS – How can I fetch custom log entries with my custom log provider?

I try to understand how the SSIS logging behavior works. I create an SSIS package that has a Data Flow Task. The Data Flow Task contains some custom data flow components. One of them overrides the RegisterLogEntries method like that: public override…
natrox
  • 91
  • 1
9
votes
8 answers

MongoDB ETL (php/java...)

Is there an ETL for MongoDB ?...
Omar
  • 101
  • 1
  • 5