Questions tagged [etl]

ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).

ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:

Extract data from homogeneous or heterogeneous data sources

Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic

Load the data in specified target tables, systems, or flat files.

The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.

It is used for movement across systems when a high volume of data and complex business rules are involved.

Big Data and the ELT approach (yes, ELT)

Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.

Commonly used ETL tools are, for example:

  • SQL Server Integration Services (SSIS)
  • Informatica PowerCenter
  • IBM Infosphere DataStage
  • Syncsort DMX / DMX-h
  • Oracle Data Integrator
  • Oracle Warehouse Builder
  • SAS Data Management
  • CloverETL

Open-source solutions are available as well, such as

  • Pentaho Data Integration (Kettle)
  • Talend Open Studio.

References:

5837 questions
23
votes
2 answers

Why does my ODBC connection fail when running an SSIS load in Visual Studio but not when running the same package using Execute Package Utility

I'm working on a Data Mart loading package in SSIS 2012. When attempting to execute the package in Visual Studio I get this error: "The AcquireConnection method call to the connection manager Data Warehouse.ssusr failed with error code…
Scott Wood
  • 1,077
  • 3
  • 18
  • 34
22
votes
6 answers

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running…
Yuva
  • 2,831
  • 7
  • 36
  • 60
22
votes
1 answer

Using Pentaho Kettle, how do I load multiple tables from a single table while keeping referential integrity?

Need to load data from a single file with a 100,000+ records into multiple tables on MySQL maintaining the relationships defined in the file/tables; meaning the relationships already match. The solution should work on the latest version of MySQL,…
blunders
  • 3,619
  • 10
  • 43
  • 65
21
votes
3 answers

How do I fix 'Invalid character value for cast specification' on a date column in flat file?

I have a CSV file with a {LF} delimiting each row and a date column with the date format as "12/20/2010" (including quotation marks) My destination column is a SQL Server 2008 database table of type date (not datetime) In my Flat File Connection…
manning18
  • 957
  • 3
  • 8
  • 18
21
votes
3 answers

alter table then update in single statement

I have a requirement where I need to Alter (Add 2 columns) and then update the same table. Here is the query I tried: ALTER TABLE A ADD c1 int,c2 varchar(10) UPDATE A set c1 = 23, c2 = 'ZZXX' I need to run the above two queries at a time. I am…
Raghunath
  • 594
  • 4
  • 12
  • 27
19
votes
6 answers

How to extract data from Google Analytics and build a data warehouse (webhouse) from it?

I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required…
nkaur301
  • 191
  • 1
  • 1
  • 5
18
votes
1 answer

does pyodbc have any design advantages over pypyodbc?

I know pyodbc is an older project and probably more featureful and robust, but is there anything about its design (based on components of compiled C code), that would make it preferable to a pure Python implementation, such as pypyodbc? I do a lot…
Tony
  • 625
  • 1
  • 6
  • 17
17
votes
3 answers

What's the most efficient way to convert a MySQL result set to a NumPy array?

I'm using MySQLdb and Python. I have some basic queries such as this: c=db.cursor() c.execute("SELECT id, rating from video") results = c.fetchall() I need "results" to be a NumPy array, and I'm looking to be economical with my memory consumption.…
thegreatt
  • 1,339
  • 2
  • 12
  • 18
17
votes
3 answers

Edit a dtsx through SSMS

I created and executed a dtsx with SSMS corresponding wizard: This was to import a flat file in an existing table. At the end I saved the "package" as a .dtsx file Now I need to modify the column mappings and re-execute this package. Is there any…
cnom
  • 3,071
  • 4
  • 30
  • 60
16
votes
7 answers

SSIS: Code page goes back to 65001

In an SSIS package that I'm writing, I have a CSV file as a source. On the Connection Manager General page, it has 65001 as the Code page (I was testing something). Unicode is not checked. The columns map to a SQL Server destination table with…
Amarundo
  • 2,357
  • 15
  • 50
  • 68
16
votes
4 answers

SSIS How to get part of a string by separator

I need an SSIS expression to get the left part of a string before the separator, and then put the new string in a new column. I checked in derived column, it seems no such expressions. Substring could only return string part with fixed length. For…
Echo
  • 1,117
  • 4
  • 22
  • 43
15
votes
1 answer

JsonPath not working properly in Spring Cloud Data Flow filter component

I'm trying to write a simple SCDF flow that reads from Kafka, filters the messages by presence of specific value and pushes data into Mongo. As part of this i had to wrote following #jsonPath …
Ján Srniček
  • 505
  • 1
  • 10
  • 34
15
votes
1 answer

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close in Node Pipeline stream

I am using the stream.pipeline functionality from Node to upload some data to S3. The basic idea I'm implementing is pulling files from a request and writing them to S3. I have one pipeline that pulls zip files and writes them to S3 successfully.…
Jason Cromer
  • 1,468
  • 3
  • 11
  • 21
15
votes
3 answers

SSIS Script Tasks losing code

I have a very strange issue happening that is causing Script Task code to clear out. I have been able to test on 2-3 different machines. We are running SSDT 15.4 preview. The steps to reproduce were as follows. Create a script task inside of a…
DataNerd
  • 349
  • 4
  • 12
15
votes
3 answers

Fill SQL database from a CSV File

I need to create a database using a CSV file with SSIS. The CSV file includes four columns: I need to use the information of that table to populate the three tables I created in SQL below. I have realized that what I need is to use one column of…
HCavill
  • 161
  • 1
  • 4