0

I need to regularly import large (hundreds of thousands of lines) tsv files into multiple related SQL Server 2008 R2 tables.

The input file looks something like this (it's actually even more complex and the data is of a different nature, but what I have here is analogous):

January_1_Lunch.tsv
+-------+----------+-------------+---------+
| Diner | Beverage | Food        | Dessert |
+-------+----------+-------------+---------+
| Nancy | coffee   | salad_steak | pie     |
| Joe   | milk     | soup_steak  | cake    |
| Pat   | coffee   | soup_tofu   | pie     |
+-------+----------+-------------+---------+

Notice that one column contains a character-delimited list that needs preprocessing to split it up.

The schema is highly normalized -- each record has multiple many-to-many foreign key relationships. Nothing too unusual here...

Meals
+----+-----------------+
| id |       name      |
+----+-----------------+
|  1 | January_1_Lunch |
+----+-----------------+

Beverages
+----+--------+
| id |  name  |
+----+--------+
|  1 | coffee |
|  2 | milk   |
+----+--------+

Food
+----+-------+
| id | name  |
+----+-------+
|  1 | salad |
|  2 | soup  |
|  3 | steak |
|  4 | tofu  |
+----+-------+

Desserts
+----+------+
| id | name |
+----+------+
|  1 | pie  |
|  2 | cake |
+----+------+

Each input column is ultimately destined for a separate table.

This might seem an unnecessarily complex schema -- why not just have a single table that matches the input? But consider that a diner may come into the restaurant and order only a drink or a dessert, in which case there would be many null rows. Considering that this DB will ultimately store hundreds of millions of records, that seems like a poor use of storage. I also want to be able to generate reports for just beverages, just desserts, etc., and I figure those will perform much better with separate tables.

The orders are tracked in relationship tables like this:

BeverageOrders
+--------+---------+------------+
| mealId | dinerId | beverageId |
+--------+---------+------------+
|      1 |       1 |          1 |
|      1 |       2 |          2 |
|      1 |       3 |          1 |
+--------+---------+------------+

FoodOrders
+--------+---------+--------+
| mealId | dinerId | foodId |
+--------+---------+--------+
|      1 |       1 |      1 |
|      1 |       1 |      3 |
|      1 |       2 |      2 |
|      1 |       2 |      3 |
|      1 |       3 |      2 |
|      1 |       3 |      4 |
+--------+---------+--------+

DessertOrders
+--------+---------+-----------+
| mealId | dinerId | dessertId |
+--------+---------+-----------+
|      1 |       1 |         1 |
|      1 |       2 |         2 |
|      1 |       3 |         1 |
+--------+---------+-----------+

Note that there are more records for Food because the input contained those nasty little lists that were split into multiple records. This is another reason it helps to have separate tables.


So the question is, what's the most efficient way to get the data from the file into the schema you see above?

Approaches I've considered:

  1. Parse the tsv file line-by-line, performing the inserts as I go. Whether using an ORM or not, this seems like a lot of trips to the database and would be very slow.
  2. Parse the tsv file to data structures in memory, or multiple files on disk, that correspond to the schema. Then use SqlBulkCopy to import each one. While it's fewer transactions, it seems more expensive than simply performing lots of inserts, due to having to either cache a lot of data or perform many writes to disk.
  3. Per How do I bulk insert two datatables that have an Identity relationship and Best practices for inserting/updating large amount of data in SQL Server 2008, import the tsv file into a staging table, then merge into the schema, using DB functions to do the preprocessing. This seems like the best option, but I'd think the validation and preprocessing could be done more efficiently in C# or really anything else.

Are there any other possibilities out there?

The schema is still under development so I can revise it if that ends up being the sticking point.

Community
  • 1
  • 1
ibadibam
  • 169
  • 8
  • 2
    4. Use SSIS................. – Mitch Wheat Jan 03 '14 at 03:42
  • I would advise you reconsider the split between beverage, meal and dessert and think about having a single table for those items with a code to indicate a 'type' or 'category'. Why? Well consider a business change that adds 'starters' or 'side orders', it's much easier to add a new category than to add a whole new table. Don't worry unduly about space inefficiency; it's much more important to get a good model. – Rikalous Jan 03 '14 at 11:04
  • @MitchWheat - I had a feeling SQL Server Integration Services might be useful but know very little about it. Could you be a little more specific? Are we talking about ETL here? – ibadibam Jan 06 '14 at 18:29
  • @Rikalous - If we're talking about hundreds of millions of rows full of nulls, where do we draw the line determining what is unduly worrying and what is prudent? – ibadibam Jan 06 '14 at 18:33

1 Answers1

1

You can import you file in the table of the following structure: Diner, Beverage, Food, Dessert, ID (identity, primary key NOT CLUSTERED - for performance issues).

After this simply add the following columns: Dinner_ID, Beverage_ID, Dessert_ID and fill them according to your separate tables (it's simple to group each of the columns and to add the missing data to lookup tables as Beverages, Desserts, Meals and, after this, to fix the imported table with the IDs for existent and newly added records).

The situation with Food table is more complex because of ability to combine the foods, but the same trick can be used: you can also add the data to your lookup table and, among this, store the combinations of foods in the additional temp table (with the unique ID) and separation on the single dishes.

When the parcing will be finished, you will have 3 temp tables:

  1. table with all your imported data and IDs for all text columns
  2. table with distinct food lists (with IDs)
  3. table with IDs of food per food combination

From the above tables you can perform the insertion of the parsed values to either structure as you want.

In this case only 1 insert (bulk) will be done to the DB from the code side. All other data manipulations will be performed in the DB.

Sandr
  • 776
  • 1
  • 6
  • 10
  • With respect to the `Food` table, where would one perform the string processing to separate the single dishes? Can SQL Server handle this efficiently? – ibadibam Jan 06 '14 at 18:38
  • Simply parser can be written which will receive a string and delimiter and will separate the string to table by supplied delimiter. – Sandr Jan 07 '14 at 06:18
  • This was a while back now, but after a few attempts to do this with SSIS, we found that it doing lookups for 100000+ rows didn't perform well. Bulk loading into staging tables, adding columns for the foreign keys, then inserting those rows into the destination tables was the way to go. – ibadibam Dec 04 '17 at 18:59