2

I am working on a method that takes in two datatables and a list of primary key column names and gives back the matches. I do not have any other info about the tables.

I have searched the site for a solution to this problem and have found some answers, but none have given me a fast enough solution.

Based on results from stackoverflow I now have this:

var matches =
    (from rowA in tableA.AsEnumerable()
     from rowB in tableB.AsEnumerable()
     where primaryKeyColumnNames.All(column => rowA[column].ToString() == rowB[column].ToString())
     select new { rowA, rowB });

The problem is this is REALLY slow. It takes 4 minutes for two tables of 8000 rows each. Before I came to stackoverflow I was actually iterating through the columns and rows it took 2 minutes. (so this is actually slower than what I had) 2-4 minutes doesn't seem so bad until I hit the table with 350,000 rows. It takes days. I need to find a better solution.

Can anyone think of a way for this be faster?

Edit: Per a suggestion from tinstaafl this is now my code.

var matches = tableA.Rows.Cast<DataRow>().Select(rowA => new 
{
    rowA,
    rowB = tableB.Rows.Find(rowA.ItemArray.Where((x, y) => 
        primaryKeyColumnNames.Contains(tableA.Columns[y].ColumnName,
               StringComparer.InvariantCultureIgnoreCase)).ToArray())
})
.Where(x => x.rowB != null);    
dspiegs
  • 548
  • 2
  • 9
  • 24

1 Answers1

3

Using the PrimaryKey property of the DataTable, which will accept an array of columns, should help. Perhaps something like this:

tableA.PrimaryKey = primaryKeyColumnNames.Select(x => tableA.Columns[x]).ToArray();
tableB.PrimaryKey = primaryKeyColumnNames.Select(x => tableB.Columns[x]).ToArray();
var matches = (from System.Data.DataRow RowA in tableA.Rows
                where tableB.Rows.Contains(RowA.ItemArray.Where((x,y) => primaryKeyColumnNames.Contains(tableA.Columns[y].ColumnName)).ToArray())
                select RowA).ToList();

In a test with 2 tables with 9900 rows and returning 9800 as common, this took about 1/3 of a second.

tinstaafl
  • 6,908
  • 2
  • 15
  • 22
  • Thank you! I will try this now. I was going down the rabbit hole, about to try either hashing the pks, or rewriting the application in F# – dspiegs Jun 04 '14 at 17:10
  • I modified your solution and I am about to start testing. You can see my change above. If it works I'll select your answer! – dspiegs Jun 04 '14 at 20:03
  • 1
    The code I submitted will work regardless of the order of the columns, since it relies on the column name. – tinstaafl Jun 04 '14 at 21:28
  • @tinstaafl... The "y" in the lambda is the column index, not name. When I changed the order of the columns in the second table the code failed. I used this to fix: http://stackoverflow.com/questions/3757997/how-to-change-datatable-colums-order – dspiegs Jun 04 '14 at 21:31
  • 1
    The y is only used to get the column name from the itemarray index. The column order must match the order of itemarray, in the datarow of that datatable. The comparison is done by columnname. If the column order between the 2 tables don't match the code might need adjusting. – tinstaafl Jun 04 '14 at 21:37
  • I was able to fix it by forcing the column order of the primary keys to be the same in both tables. Thank you for your help. The programs is much faster now. DataTables must hash primary keys – dspiegs Jun 04 '14 at 21:53