Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

  • SQL Server Integration Services (SSIS)
  • Informatica PowerCenter
  • IBM Infosphere DataStage
  • Syncsort DMX / DMX-h
  • Oracle Data Integrator
  • Oracle Warehouse Builder
  • SAS Data Management
  • CloverETL

Open-source solutions are available as well, such as

  • Pentaho Data Integration (Kettle)
  • Talend Open Studio.

References:

5837 questions
31
votes
3 answers

Writing JSON column to Postgres using Pandas .to_sql

During an ETL process I needed to extract and load a JSON column from one Postgres database to another. We use Pandas for this since it has so many ways to read and write data from different sources/destinations and all the transformations can be…
peralmq
  • 2,160
  • 1
  • 22
  • 20
31
votes
1 answer

Load CSV into Redshift, with header?

Is there an option to load a CSV into Redshift with a header? I see the documentation for CSV but it says nothing about a header. Ideally it could use the header to determine the columns to load.
Some Guy
  • 12,768
  • 22
  • 58
  • 86
30
votes
8 answers

Get Last non empty column and row index from excel using Interop

I am trying to remove all extra blank rows and columns from an excel file using Interop Library. I followed this question Fastest method to remove Empty rows and Columns From Excel Files using Interop and i find it helpful. But i have excel files…
Yahfoufi
  • 2,220
  • 1
  • 22
  • 41
30
votes
1 answer

SSIS - the value cannot be converted because of a potential loss of data

I am relatively new to SSIS. I am trying to extract information from an Oracle database using Microsoft OLEDB for Oracle and I am using this query: SELECT ID FROM Test I get an error message saying: the value cannot be converted because of a…
w0051977
  • 15,099
  • 32
  • 152
  • 329
29
votes
1 answer

How do I unit test and integration test my SSIS packages?

This question mentions two libraries, both of which aren't maintained and one has broken links to the source and documentation. SSISUnit was last updated in 2008 and SSIStester has broken links in the documentation and hasn't been updated since…
mattrowsboats
  • 582
  • 1
  • 6
  • 14
28
votes
10 answers

Unit testing large blocks of code (mappings, translation, etc)

We unit test most of our business logic, but are stuck on how best to test some of our large service tasks and import/export routines. For example, consider the export of payroll data from one system to a 3rd party system. To export the data in…
Andrew
  • 1,765
  • 2
  • 17
  • 24
28
votes
3 answers

DAG(directed acyclic graph) dynamic job scheduler

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch…
26
votes
3 answers

Adding logs to Airflow Logs

How can I add my own logs onto the Apache Airflow logs that are automatically generated? any print statements wont get logged in there, so I was wondering how I can add my logs so that it shows up on the UI as well?
user3542930
  • 537
  • 2
  • 6
  • 12
25
votes
2 answers

How to add a new Struct column to a DataFrame

I'm currently trying to extract a database from MongoDB and use Spark to ingest into ElasticSearch with geo_points. The Mongo database has latitude and longitude values, but ElasticSearch requires them to be casted into the geo_point type. Is there…
Kim Ngo
  • 471
  • 2
  • 7
  • 14
24
votes
1 answer

ETL in Java Spring Batch vs Apache Spark Benchmarking

I have been working with Apache Spark + Scala for over 5 years now (Academic and Professional experiences). I always found Spark/Scala to be one of the robust combos for building any kind of Batch or Streaming ETL/ ELT applications. But lately, my…
underwood
  • 845
  • 2
  • 11
  • 22
24
votes
3 answers

SSIS - How to access a RecordSet variable inside a Script Task

How do you access a RecordSet variable inside a Script Task?
GordyII
  • 7,067
  • 16
  • 51
  • 69
24
votes
3 answers

Dynamically assign filename to excel connection string

This is my very first time playing with SSIS in SQL Server 2012. I can successfully read an excel file and load its content to a table in SQL server 2012. The task is a simple direct read excel file then copy to sql server with no validation or…
user1205746
  • 3,110
  • 11
  • 44
  • 73
23
votes
1 answer

Missing SSIS toolbox in Visual Studio

I am running SQL Server 2014 on my machine. I have installed SSDT for Visual Studio 2017, which I also have. Now I am not able to find the SSIS Toolbox in my Visual Studio. Do I need to re-install the SSDT all over again? Please let me know. Thanks…
Shriram Rao
  • 251
  • 1
  • 2
  • 6
23
votes
4 answers

Java ETL: hard to find a suitable one

I'm looking for an embeddable Java ETL, i.e., an Extract Transform Load engine that can be called from Java code. I'm finding it surprisingly hard to find a suitable one. I'm mainly looking at loading delimited text files into database tables,…
tpdi
  • 34,554
  • 11
  • 80
  • 120
23
votes
4 answers

Is it possible to use relative paths for SSIS packages dtsConfig files?

I am trying to make our SQL Server Integration Services packages as portable as possible and the one thing that is preventing that is that the path to the config is always an absolute path, which makes testing and deployment a headache. Are there…
John Lemp
  • 5,029
  • 3
  • 28
  • 36