4

I have a working MySQL data warehouse that is organized as a star schema and I am using Talend Open Studio for Data Integration 5.1 to create the ETL process. I would like this process to run once per day. I have estimated that one of the dimension tables (dimUser) will have approximately 2 million records and 23 columns.

I created a small test ETL process in Talend that worked, but given the amount of data that may need to be updated daily, the current performance will not cut it. It takes the ETL process four minutes to UPDATE or INSERT 1,000 records to dimUser. If I assumed a linear relationship between the count of records and the amount of time to UPDATE or INSERT, then there is no way the ETL can finish in 3-4 hours (my hope), let alone one day.

Since I'm unfamiliar with Java, I wrote the ETL as a Python script and ran into the same problem. Although, I did discover that if I did only INSERT, the process went much faster. I am pretty sure that the bottleneck is caused by the UPDATE statements.

The primary key in dimUser is an auto-increment integer. My friend suggested that I scrap this primary key and replace it with a multi-field primary key (in my case, 2-3 fields).

Before I rip the test data out of my warehouse and change the schema, can anyone provide suggestions or guidelines related to

  1. the design of the data warehouse
  2. the ETL process
  3. how realistic it is to have an ETL process INSERT or UPDATE a few million records each day
  4. will my friend's suggestion significantly help

If you need any further information, just let me know and I'll post it.

UPDATE - additional information:

mysql> describe dimUser;
Field                        Type                Null Key Default            Extra
user_key                     int(10) unsigned    NO   PRI NULL               auto_increment
id_A                         int(10) unsigned    NO       NULL
id_B                         int(10) unsigned    NO       NULL
field_4                      tinyint(4) unsigned NO       0
field_5                      varchar(50)         YES      NULL
city                         varchar(50)         YES      NULL
state                        varchar(2)          YES      NULL
country                      varchar(50)         YES      NULL
zip_code                     varchar(10)         NO       99999
field_10                     tinyint(1)          NO       0
field_11                     tinyint(1)          NO       0
field_12                     tinyint(1)          NO       0
field_13                     tinyint(1)          NO       1
field_14                     tinyint(1)          NO       0
field_15                     tinyint(1)          NO       0
field_16                     tinyint(1)          NO       0
field_17                     tinyint(1)          NO       1
field_18                     tinyint(1)          NO       0
field_19                     tinyint(1)          NO       0
field_20                     tinyint(1)          NO       0
create_date                  datetime            NO       2012-01-01 00:00:00
last_update                  datetime            NO       2012-01-01 00:00:00
run_id                       int(10) unsigned    NO       999

I used a surrogate key because I had read that it was good practice. Since, from a business perspective, I want to keep aware of potential fraudulent activity (say for 200 days a user is associated with state X and then the next day they are associated with state Y - they could have moved or their account could have been compromised), so that is why geographic data is kept. The field id_B may have a few distinct values of id_A associated with it, but I am interested in knowing distinct (id_A, id_B) tuples. In the context of this information, my friend suggested that something like (id_A, id_B, zip_code) be the primary key.

For the large majority of daily ETL processes (>80%), I only expect the following fields to be updated for existing records: field_10 - field_14, last_update, and run_id (this field is a foreign key to my etlLog table and is used for ETL auditing purposes).

Jubbles
  • 4,450
  • 8
  • 35
  • 47
  • If you change your primary key to something that may change during the update, then the index will need to be recalculated every time it's updated, which will be slower. It seems very odd that it would take 4 minutes to update only 100 rows on a medium sized table, perhaps there is a problem with your script rather than the schema. Have you tried benchmarking the updates directly in sql vs from your script? – limscoder Jul 02 '12 at 03:12
  • 1
    how did you came up with few million every day? – Damir Sudarevic Jul 02 '12 at 13:42
  • Your key is fine, are you using key pipeline when loading? What is you business key on dimUSer? – Damir Sudarevic Jul 02 '12 at 13:43
  • @Damir: The two million value comes from historically observed data. What is a key pipeline? – Jubbles Jul 02 '12 at 13:44
  • Ok, first -- ETL should pick up only changed data from your operations system.Key pipeline is a table (buit once a day..) which looks like (UserBusinessKEy, PrimaryKey) for existing users. This way it is fast to split existing records form new records into different streams. – Damir Sudarevic Jul 02 '12 at 13:50
  • 1
    Do read KImball's book. It is all plain vanilla. – Damir Sudarevic Jul 02 '12 at 13:51
  • Focus on "Loading Type-2 Dimensions". – Damir Sudarevic Jul 02 '12 at 15:21

1 Answers1

1

Here's my thoughts on your questions.

1) Warehouse Design:

Read Ralph Kimball's Book: The Data Warehouse Toolkit.

Your dimension table has a bunch of columns with meaningless names. Instead of field_5, the column should be given a name that has a business meaning. Data Warehousing is for ease of querying by business & reporting folks.

I don't see any fact tables here. Understanding what the user dimension will be used for is important in it's design.

2) The ETL Process

Have you identified where the bottleneck in the ETL process is? Is it on the reading of data from the source, the transformation of that data, or the writing to the database? You may be writing at 40,000 rows / sec but if you can only read 1,000 rows/sec from an XML data source, you're not going to get very far.

Have you considered loading the changed records to a stage table in the database first, without any transformation, then using SQL to transform and update the data? Often you'll find performance in the database is better than offloading the work to an ETL Tool.

3) It's very realistic to update a few million records daily, if the hardware can handle it. I think it's important to understand if you are just going for a Type 1 dimension where you just overwrite changes (in which case a delete change rows, and then insert, may be a better option than a update/else/insert).

If you are keeping history of changes in a type 2 dimension, you might want to consider snowflaking the fields you want to track changes on in a separate minidimension. Kimball discusses this technique when you have a very large "customer" dimension. You would then use a periodic snapshot fact table which would allow you to track the changes to the users over time.

4) Your friend's suggestion to create a primary key out of the natural business keys is not a good idea for a data warehouse environment. We create an integer surrogate key so we can include it in the Fact Tables, to keep them skinny, since they will be orders of magnitude larger than the dimension tables.

N West
  • 6,768
  • 25
  • 40
  • @N West: Thank you for your answer, although I think you took the information from my post a little too literally. Some of the fields listed above have generic names because I am working within a business environment and do not wish to post information that may identify which employer I work for. Also, while I did solicit answers related to design of the data warehouse, I never thought I would receive a response about field naming practices. Also (again), I do have fact tables, but didn't think it was necessary to post the entire schema of the data warehouse. – Jubbles Jul 05 '12 at 22:33
  • @Jubbles - Understood :) - I just see too often that the "AttributeX" columns from something like an Oracle ERP are duplicated in a data warehouse, instead of giving them solid business names. As to the schema - how a dimenson is designed can often be affected by how it will be used in the facts. If you're going for detailed user analytics, it may make sense to split the user dimension into multiple dims to allow for faster slicing / dicing on the fact table. – N West Jul 06 '12 at 01:20