1

I'm trying to improve the loading of a dimensional data warehouse. I need to load tens of thousands of records each time I run the loading operation; sometimes I need to load hundreds of thousands of records; semi-annually I need to load millions of records.

For the loading operation, I build an in-memory record corresponding to a 48-column Fact table record which is to be updated or inserted (depending on whether a previous version of the record is already in the Fact table). The in-memory record includes about 2 dozen foreign key pointers to various dimension tables. I have indexes on all of those Fact table foreign keys (and of course I can disable those indexes while loading the data warehouse).

After populating the in-memory record with fresh data for the Fact table record, I add it to the Fact table (update or insert, as above).

I am wondering if I could improve performance in update situations by updating only those columns which have changed, instead of mindlessly updating every column in the Fact table record. Doing this would add some overhead in that my loading program would become more complex. But would I gain any benefit anyway? Suppose for example that one of those dimension table foreign keys hasn't changed. Is there going to be a performance benefit if I do not update that particular column? I guess it comes down to whether SQL Server internally tracks (counts??) foreign key references between the Fact and dimension tables. If SQL Server does do such tracking, then perhaps there would be a performance benefit if I do not update the column, since then SQL Server would not need to perform its internal tracking operation. But if SQL Server does not do such tracking then I suppose that I should just update all 48 Fact table columns regardless of whether they have changed or not, since that would avoid adding complexity & overhead to my loading program.

I'd appreciate any comments or suggestions.

BRW
  • 187
  • 1
  • 10
  • 1
    There's not really a question that can be answered here. But it doesn't sound like you have enough data to make any optimizations worth the effort and increased complexity. – David Browne - Microsoft Oct 06 '19 at 23:40
  • What do you mean by in-memory record? What special sauce do you use to make it in-memory? What language is your loader built in? You mention "loading program" but is that a stored procedure or a C# app? – Nick.Mc Oct 07 '19 at 02:06
  • Nick: It's a VB6 program. By "in-memory record" I mean a [user-defined type.](https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/type-statement) – BRW Oct 07 '19 at 03:17
  • If you are doing RBAR operations in a VB6 app to load data, then you can improve performance by 1. Importing all records into a staging table in the database in a single bulk insert operation, 2. Executing a couple of commands (INSERT, UPDATE) to process the data in one go. Running row by row data loads in any kind of external code is one of the slowest ways possible of loading data. – Nick.Mc Oct 09 '19 at 13:30
  • Nick: Thanks for your additional comment. As I'm sure you (and others) have guessed, I'm learning about RBAR operations the hard way! I'm going to try your suggested bulk insert approach, once I figure out how to do that with xml data. Specifically, my input data consists of tens of thousands (or more) of complex xml files. So, my VB6 loading program is an xml shredder. If anyone can point me to any helpful links pertaining to bulk insert of xml data for shredding purposes, I'd appreciate it. – BRW Oct 09 '19 at 17:35
  • OK that's a _little_ different. XML is a painful and redundant format for bulk data integration but I guess you're working that out. Is the XML heavily nested? Is it large? You can shred it in your program and stage tabular data or you can insert the XML into the database and shred it in there. An external program is probably quicker for shredding but bulk updates/inserts are always faster than individual inserts to the database – Nick.Mc Oct 09 '19 at 22:19
  • Nick: Yes, it's heavily nested. Many hundreds (sometimes thousands) of elements per xml file, of which I'm interested in only about 4 or 5 dozen. Currently, I'm shredding into the in-memory record mentioned in my OP, then updating the Fact Table with that via a SQL Server Sproc. It seems to me that whether I shred into a staging database or shred directly into the data warehouse, I'll face the same speed bottleneck. So, I am going to read up on bulk updates/inserts, as well as the notion of inserting the XML into the DW and shredding it in there. I've never tried those techniques. – BRW Oct 10 '19 at 03:43
  • By nested I mean, does it have elements inside elements inside elements? Or is just a big flat list of elements? (which might have some level of nesting around it) – Nick.Mc Oct 10 '19 at 10:40
  • Yes, it's nested. I recursively examine every element in every file (XPath), checking to see if each element is one I'm interested in. This brute force approach works, but it’s ridiculously inefficient. Also, the XML files don’t contain the same sets of elements in every file. Some elements exist in some files but not in others. None of the XML files contain every possible element. There are many different possible combinations of elements which may be included or excluded from a particular XML file, depending on the specific circumstances to which that file pertains. – BRW Oct 10 '19 at 15:43

0 Answers0