11

I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ?

This ETL will be as general purpose as possible, with a scriptable and modular approach. Mostly it will be used to keep different databases in sync, and to import/export datasets in different formats (xml and csv) I don't need any multidimensional tools, but it is a possibility that it'll needed later.

Stennie
  • 63,885
  • 14
  • 149
  • 175
edomaur
  • 1,397
  • 4
  • 23
  • 38
  • What use cases do you have right now? What do you need it to do? Please list your current use cases so we can comment on those. – S.Lott Aug 24 '09 at 10:27

3 Answers3

16

Let's think of the ETL use cases for a moment.

  1. Extract.
    • Read databases through a generic DB-API adapter.
    • Read flat files through a similar adapter.
    • Read spreadsheets through a similar adapter.
  2. Cleanse.
    • Arbitrary rules
    • Filter and reject
    • Replace
    • Add columns of data
  3. Profile Data.
    • Statistical frequency tables.
  4. Transform (see cleanse, they're two use cases with the same implementation)
  5. Do dimensional conformance lookups.
    • Replace values, or add values.
  6. Aggregate.
    • At any point in the pipeline
  7. Load.
    • Or prepare a flat-file and run the DB product's loader.

Further, there are some additional requirements that aren't single use cases.

  • Each individual operation has to be a separate process that can be connected in a Unix pipeline, with individual records flowing from process to process. This uses all the CPU resources.

  • You need some kind of time-based scheduler for places that have trouble reasoning out their ETL preconditions.

  • You need an event-based schedule for places that can figure out the preconditions for ETL processing steps.

Note. Since ETL is I/O bound, multiple threads does you little good. Since each process runs for a long time -- especially if you have thousands of rows of data to process -- the overhead of "heavyweight" processes doesn't hurt.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
4

Here's a random list, in no particular order:

  1. Connect to a wide range of sources, including all the major relational databases.
  2. Handle non-relational data sources like text files, Excel, XML, etc.
  3. Allow multiple sources to be mapped into a single target.
  4. Provide a tool to help map from source to target fields.
  5. Offer a framework for injecting transformations at will.
  6. Programmable API for writing complex transformations.
  7. Optimize load process for speed.
duffymo
  • 305,152
  • 44
  • 369
  • 561
  • I'd like to see "graphical" removed from #4. While the pictures are cute, we always wind up reading code to see what's *really* happening. – S.Lott Aug 24 '09 at 10:18
0

Automatic / heuristic mapping of column names. E.g simple string mappings:

DB1: customerId

DB2: customer_id

I find a lot of the work I (have) done in DTS / SSIS could've been automatically generated.

  • not necessarily "required functionality", but would keep a lot of your users very happy indeed.
geejay
  • 5,440
  • 8
  • 50
  • 60