What are the required functionalities of ETL frameworks?

Question

I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ?

This ETL will be as general purpose as possible, with a scriptable and modular approach. Mostly it will be used to keep different databases in sync, and to import/export datasets in different formats (xml and csv) I don't need any multidimensional tools, but it is a possibility that it'll needed later.

What use cases do you have right now? What do you need it to do? Please list your current use cases so we can comment on those. — S.Lott, Aug 24 '09 at 10:27

score 16 · Accepted Answer · answered Aug 24 '09 at 10:26

Let's think of the ETL use cases for a moment.

Extract.
- Read databases through a generic DB-API adapter.
- Read flat files through a similar adapter.
- Read spreadsheets through a similar adapter.
Cleanse.
- Arbitrary rules
- Filter and reject
- Replace
- Add columns of data
Profile Data.
- Statistical frequency tables.
Transform (see cleanse, they're two use cases with the same implementation)
Do dimensional conformance lookups.
- Replace values, or add values.
Aggregate.
- At any point in the pipeline
Load.
- Or prepare a flat-file and run the DB product's loader.

Further, there are some additional requirements that aren't single use cases.

Each individual operation has to be a separate process that can be connected in a Unix pipeline, with individual records flowing from process to process. This uses all the CPU resources.
You need some kind of time-based scheduler for places that have trouble reasoning out their ETL preconditions.
You need an event-based schedule for places that can figure out the preconditions for ETL processing steps.

Note. Since ETL is I/O bound, multiple threads does you little good. Since each process runs for a long time -- especially if you have thousands of rows of data to process -- the overhead of "heavyweight" processes doesn't hurt.

duffymo · Answer 2 · 2009-08-24T22:25:53.623

4

Here's a random list, in no particular order:

Connect to a wide range of sources, including all the major relational databases.
Handle non-relational data sources like text files, Excel, XML, etc.
Allow multiple sources to be mapped into a single target.
Provide a tool to help map from source to target fields.
Offer a framework for injecting transformations at will.
Programmable API for writing complex transformations.
Optimize load process for speed.

edited Aug 24 '09 at 22:25

answered Aug 24 '09 at 09:49

duffymo

305,152
44
369
561

I'd like to see "graphical" removed from #4. While the pictures are cute, we always wind up reading code to see what's *really* happening. – S.Lott Aug 24 '09 at 10:18

score 0 · Answer 3 · answered Aug 24 '09 at 11:38

Automatic / heuristic mapping of column names. E.g simple string mappings:

DB1: customerId

DB2: customer_id

I find a lot of the work I (have) done in DTS / SSIS could've been automatically generated.

not necessarily "required functionality", but would keep a lot of your users very happy indeed.

What are the required functionalities of ETL frameworks?

3 Answers3