0

Thanks for reading this.

Here's my goal: I have two Datatables I've read in from Excel Worksheets. The Datatables have the same schema (Columns A, B, C, ... in Datatable1 has same kind of data as Column A, B, C, ... in Datatable2).

I need to compare data in the tables by arbitrary columns (i.e. for comparison only Column A and C matter, but I need to keep data in Column A, B, C, ..., N).

Since I'm reading these in from Excel Worksheets, the schema can't be expected. For example, if I load a different set of Worksheets, the comparison columns may differ. For this reason, I can't use LINQ, which is like a hard coded SQL statement.

I need to perform the equivalent of a FULL OUTER JOIN. I'm trying to show all data, including missing data from either datatable which doesn't appear in the other datatable.

I've read a little bit about DataRelations, but I'm not sure how to use them.

Please provide example code.

Thanks in advance!

JessStuart
  • 1
  • 1
  • 2
  • Is this really a full join or a merge? – Corey Jul 23 '14 at 06:17
  • *which is like a hard coded SQL statement.* <--- And what's a string literal containing a SQL statement...? – Matías Fidemraizer Jul 23 '14 at 06:36
  • @Corey: I need a full outer join (showing nulls for missing data from the two worksheets). – JessStuart Jul 24 '14 at 05:13
  • @MatíasFidemraizer: The LINQ statements are made of C# code, and can't be modified programmatically. A String SQL statement can be built to match whatever worksheet schema encountered. – JessStuart Jul 24 '14 at 05:15
  • @JessStuart No comments – Matías Fidemraizer Jul 24 '14 at 06:10
  • @JessStuart So you want the output to have twice the columns? This is simple enough... but seldom useful. Not hard to do with Linq either, regardless of the column counts. If this is what you want I'll give an answer. – Corey Jul 24 '14 at 07:16
  • @Corey That's what I want. The point is to record a "Before" set of data, and compare it to an "After" set of data from the same source (but after a software update). **Here's the problem, though.** This has to work between _C# DataTables_ (if this was in a database, it would be simple to set up the SQL). I don't know how to perform a full outer join between arbitrary sets of data. I doubt LINQ can be used when the datatable schema is unknown. – JessStuart Jul 25 '14 at 03:10

1 Answers1

2

Given a pair of DataTables with an arbitrary number of columns, and given a function that can create a grouping value of a reasonable type from each of the two DataTables, you can use Linq to do most of the work for you.

Let's start with a function to extract the join key from the DataTables. It'd be nice to just return an object[], but they don't compare well. We can do it with a Tuple<object, object> however - those work nicely for this purpose. And if you need more columns you can just add more columns :P

// Produces a JoinKey as Tuple containing columns 'A' and 'C' (0 and 2)
public Tuple<object, object> GetJoinKey(DataRow row)
{
    return Tuple.Create(row[0], row[2]);
}

Now the join. We can't do a full outer join directly, but we can do an outer join both ways and Union the results:

// given DataTables table1 & table2:
var outerjoin = 
    (
        from row1 in table1.AsEnumerable()
        join row2 in table2.AsEnumerable() 
            on GetJoinKey(row1) equals GetJoinKey(row2)
            into matches
        from row2 in matches.DefaultIfEmpty()
        select new { key = GetJoinKey(row1), row1, row2 }
    ).Union(
        from row2 in table2.AsEnumerable()
        join row1 in table1.AsEnumerable()
            on GetJoinKey(row2) equals GetJoinKey(row1)
            into matches
        from row1 in matches.DefaultIfEmpty()
        select new { key = GetJoinKey(row2), row1, row2 }
    );

Next up you have to create a suitable output format - a DataTable that has all the rows from both sources, plus a field to hold some info about the key:

DataTable result = new DataTable();
// add column for string value of key:
result.Columns.Add("__key", typeof(string));
// add columns from table1:
foreach (var col in table1.Columns.OfType<DataColumn>())
    result.Columns.Add("T1_" + col.ColumnName, col.DataType);
// add columns from table2:
foreach (var col in table2.Columns.OfType<DataColumn>())
    result.Columns.Add("T2_" + col.ColumnName, col.DataType);

And finally, fill the table from the query:

var row1def = new object[table1.Columns.Count];
var row2def = new object[table2.Columns.Count];
foreach (var src in outerjoin)
{
    // extract values from each row where present
    var data1 = (src.row1 == null ? row1def : src.row1.ItemArray);
    var data2 = (src.row2 == null ? row2def : src.row2.ItemArray);

    // create row with key string and row values
    result.Rows.Add(new object[] { src.key.ToString() }.Concat(data1).Concat(data2).ToArray());
}

Of course we could short out a couple of those operations to get a single Linq query that does 99% of the work for us. I'll leave that to you to play with if it sounds like fun.

Here's the full method, done as an extension with a generic function for the join key generator, making it reasonably generic:

public static DataTable FullOuterJoin<T>(this DataTable table1, DataTable table2, Func<DataRow, T> keygen)
{
    // perform inital outer join operation
    var outerjoin = 
        (
            from row1 in table1.AsEnumerable()
            join row2 in table2.AsEnumerable() 
                on keygen(row1) equals keygen(row2)
                into matches
            from row2 in matches.DefaultIfEmpty()
            select new { key = keygen(row1), row1, row2 }
        ).Union(
            from row2 in table2.AsEnumerable()
            join row1 in table1.AsEnumerable()
                on keygen(row2) equals keygen(row1)
                into matches
            from row1 in matches.DefaultIfEmpty()
            select new { key = keygen(row2), row1, row2 }
        );

    // Create result table
    DataTable result = new DataTable();
    result.Columns.Add("__key", typeof(string));
    foreach (var col in table1.Columns.OfType<DataColumn>())
        result.Columns.Add("T1_" + col.ColumnName, col.DataType);
    foreach (var col in table2.Columns.OfType<DataColumn>())
        result.Columns.Add("T2_" + col.ColumnName, col.DataType);

    // fill table from join query
    var row1def = new object[table1.Columns.Count];
    var row2def = new object[table2.Columns.Count];
    foreach (var src in outerjoin)
    {
        // extract values from each row where present
        var data1 = (src.row1 == null ? row1def : src.row1.ItemArray);
        var data2 = (src.row2 == null ? row2def : src.row2.ItemArray);

        // create row with key string and row values
        result.Rows.Add(new object[] { src.key.ToString() }.Concat(data1).Concat(data2).ToArray());
    }

    return result;
}

Now, IF the tables have the same schema (which the above doesn't care about), you can do almost exactly the same thing - modify the result table generation to just clone one of the tables, then add some merge logic in the load loop.

Here's a Gist of the above with testing and verification that it's doing what it says. Drop that in your compiler and see what you get out.

Corey
  • 15,524
  • 2
  • 35
  • 68
  • This is a awesome! I wish I had enough reputation to vote up! – JessStuart Jul 25 '14 at 19:47
  • Does this have to be a static class? What is the advantage of using a static class in this case? – JessStuart Aug 27 '14 at 14:06
  • I made it a static method both because it uses no instance data and so that (when you put it in a static class) it can work as an extension method on `DataTable`... so you can use `table1.FullOuterJoin(table2, keygen);` instead of `FullOuterJoin(table1, table2, keygen);` – Corey Aug 28 '14 at 00:27
  • This works great for reproducing the data in a full out join, but it doesn't preserve the original content order. To get the order, I had to cycle through each table, compare on key values, and copy rows from each table to the result as based on the comparison. I'm very grateful for your example @corey, as I hadn't ever used tuples before. – JessStuart Sep 07 '14 at 00:34
  • Tuples are handy things, so I'm glad I could introduce you to them. – Corey Sep 07 '14 at 22:16