-3

I'm moving data from ODBC to OLE Destination, records get inserted everyday on the ODBC in different tables. The packages gets slower and slower it take about a day for million records sometimes more. The tables can have new data inserted or new updated data and the loading and looking up of new data slows the processs. Is the anyway i can fast track the ETL process or is there any open source platform i can use to load the data faster

Tried to count the number of rows in the OLE Destination to check and only insert new records that are greater than the ones in the ODBC Source, but to my surprise the ROW_NUMBER() function isn't supported in Openedge ODBC

Tom Bascom
  • 13,405
  • 2
  • 27
  • 33
  • Does this answer your question? [SSIS - OLE DB Destination - Table or Views load vs. Fast-load](https://stackoverflow.com/questions/14466498/ssis-ole-db-destination-table-or-views-load-vs-fast-load) – PausePause Feb 06 '23 at 16:55
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 07 '23 at 07:19

2 Answers2

0

Based on the limited information in your question, I'd design your packages like the following

SEQC PG to SQL

The point of these operations is to transfer data from our source system verbatim to the target. The target table should be brand new and the SQL Server equivalent of the PG table from a data type perspective. Clustered Key if one exists, otherwise, see how a heap performs. I am going to reference this as a staging table.

enter image description here

The Data Flow itself is going to be bang simple

enter image description here

By default, the destination will perform a fast load and lock the table.

Run the package and observe times.

Edit the OLE DB Destination and change the Maximum Commit Size to something less than 2147483647. Try 100000 - is it better, worse? Move up/down an order of magnitude until you have an idea of what it looks like will be the fastest the package can move data.

There are a ton of variables at this stage of the game - how busy is the source PG database, what are the data types involved, how far does the data need to travel from the Source, to your computer, to the Destination but this can at least help you understand "can I pull (insert large number here) rows from the source system within the expected tolerance" If you can get the data moved from PG to SQL within the expected SLA and you still have processing time left, then move on to the next section.

Otherwise, you have to rethink your strategy for what data gets brought over. Maybe there's reliable (system generated) insert/update times associated to the rows. Maybe it's a financial-like system where rows aren't updated, just new versions of the row are insert and the net values are all that matters. Too many possibilities here but you'll likely need to find a Subject Matter Expert on the system - someone who knows the logical business process the database models as well as how the data is stored in the database. Buy that person some tasty snacks because they are worth their weight in gold.

Now what?

At this point, we have transferred the data from PG to SQL Server and we need to figure out what to do with it. 4 possibilities exist

  • The data is brand new. We need to add the row into the target table
  • The data is unchanged. Do nothing
  • The data exists but is different. We need to change the existing row in the target table
  • There is data in the target table that isn't in the staging table. We're not going to do anything about this case either.

Adding data, inserts, are easy and can be fast - it depends on table design.

Changing data, updates, are less easy in SSIS and are slower than adding new rows. Slower because behind the scenes, the database will delete and add the row back in.

Non-Clustered indexes are also potential bottlenecks here, but they can also be beneficial. Welcome to the world of "it depends"

Option 1 is to just write the SQL statements to handle the insert and update. Yes, you have a lovely GUI tool for creating data flows but you need speed and this is how you get it (especially since we've already moved all the data from the external system to a central repository)

Option 2 is to use a Data Flow and potentially an Execute SQL Task to move the data. The idea being, the Data Flow will segment your data into New which will use an OLE DB Destination to write the inserts. The updates - it depends on volume what makes the most sense from an efficiency perspective. If it's tens, hundreds, thousands of rows to update, eh take the performance penalty and use an OLE DB Command to update the row. Maybe it's hundreds of thousands and the package runs good enough, then keep it.

Otherwise, route your changed rows to yet another staging table and then do a mass update from the staged updates to the target table. But at this point, you just wrote half the query you needed for the first option so just write the Insert and be done (and speed up performance because now everything is just SQL Engine "stuff")

billinkc
  • 59,250
  • 9
  • 102
  • 159
0

You might want to investigate Progress' Change Data Capture feature. If you have a modern release of OpenEdge (11.7 or better) and the proper licenses you can enable CDC policies to track changes. Your ETL process could then use that information to target its efforts.

Warning: it's complicated. There is a lot more to actually doing it than marketing would have you believe. But if your use-case is straight-forward it might not be too terrible.

Or you could implement Progress "Pro2" product to do all the dirty work for you. (That's an extra cost option.)

Tom Bascom
  • 13,405
  • 2
  • 27
  • 33