1

I'm looking for a way to return a dynamic column list from a LINQ join of two datatables.

First, this is not a duplicate. I have already studied and discarded:

C# LINQ list select columns dynamically from a joined dataset

Creating a LINQ select from multiple tables

How to do a LINQ join that behaves exactly like a physical database inner join?

(and many others)

Here is my starting point:

public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
   DataTable result = ( from dataRows1 in dt1.AsEnumerable()
                        join dataRows2 in dt2.AsEnumerable()
                        on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                        [...I NEED HELP HERE with the SELECT....]).CopyToDataTable();
   return result;
}

A few notes and requirements:

  1. There is no database engine. The data sources are large CSV files (500K+ records) being read into c# DataTables.
  2. Because the CSVs are large, looping through each record in the join is a bad solution for performance reasons. I've already tried record looping and it's just too slow. I get great performance on the join above, but I can't find a way to have it return just the columns I want (specified by the caller) without looping records.
  3. If I need to loop over columns in the join, that is perfectly fine, I just don't want to loop rows.
  4. I want to be able to pass in an array of column names and return just those columns in the resulting DataTable. If both datatables being passed in happen to have a column named the same, and if that column is in my array of column names, just pass back either column because the data will be the same between the 2 columns in that case.
  5. If I need to pass in 2 arrays (1 for each datatable's desired columns) that's fine, but 1 array of column names would be ideal.
  6. The column list cannot be static and hardcoded into the function. The reason is because my JoinDataTables() is called from many different places in my system in order to join a wide variety of CSVs-turned-datatables, and each CSV file has very different columns.
  7. I don't want all columns returned in the resulting DataTable -- just the columns I specify in the columns array.

So suppose, before calling JoinDataTables(), I have the following 2 datatables:

Table: T1
T1A T1B T1C T1D
==================
10  AA  H1  Foo1
11  AB  H1  Foo2
12  AA  H2  Foo1
13  AB  H2  Foo2

Table: T2
T2A T2X T2Y T2Z
==================
12  N1  O1  Yeah1
17  N2  O2  Yeah2
18  N3  O1  Yeah1
19  N4  O2  Yeah2

Now suppose we join these 2 tables like so: ON T1.T1A = T2.T2A

select * from [join]

and that yields this resultset:

T1A T1B T1C T1D   T2A T2X T2Y T2Z
====================================
12  AA  H2  Foo1  12  N1  O1  Yeah1

Notice that only 1 row is yielded by the join.

Now to the crux of my question. Suppose that for a given use case, I want to return only 4 columns from this join: T1A, T1D, T2A, and T2Y. So my resultset would then look like this:

T1A T1D   T2A  T2Y
==================
12  Foo1  12   O1

I'd like to be able to call my JoinDataTables function like so:

DataTable dt = JoinDataTables(dt1, dt2, "T1A", "T2A", new string[] {"T1A", "T1D", "T2A", "T2Y"});

Keeping in mind performance and the fact that I don't want to loop through records (because it's slow for large sets of data), how can this be accomplished? (The join is already working well, now I just need a correct select segment (whether via new{..} or whatever you think)).

I cannot accept a solution with a hardcoded column list inside the function. I have found examples of that approach all over SO.

Any ideas?

EDIT: I'd be ok getting ALL columns back every time, but every attempt I've made to include all columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join.

HerrimanCoder
  • 6,835
  • 24
  • 78
  • 158
  • *"I don't want all columns returned in the resulting DataTable"* -- **Why don't you?** If I did have a valid reason to exclude columns (I'm not saying you don't), I'd select everything in the join, then write code to grab just the columns I want from result and stuff them into a new DataTable that had just the required rows. I'd look at programmatically generating a LINQ expression to select the columns I wanted, but there's a good chance I'd end up just writing a bunch of regular C# code to create datarows. If you're not comfortable with the first option, the second is dull but routine. – 15ee8f99-57ff-4f92-890c-b56153 Jan 24 '19 at 22:31
  • Ed, I should have mentioned it, but every attempt I've made to include ALL columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join. – HerrimanCoder Jan 24 '19 at 22:57
  • That puts a very different complexion on it. Can you do any join at all without getting a cross join? I think that’s the real issue here and I think it’s unrelated to selecting columns. I’d look hard at what’s in the columns you’re joining on. Have you tried this with small controlled test data files? – 15ee8f99-57ff-4f92-890c-b56153 Jan 25 '19 at 00:00
  • Yes, I can get a valid inner join if I return ONLY the columns from DT1. But that's not sufficient. For example, if I do this: `select dataRows1).Distinct().CopyToDataTable();` -- all is right and good, except I can't find a way to also return columns from DT2 without iterating every record and building a new DT from the ground up. There MUST be a simpler, faster, and more elegant way than that. – HerrimanCoder Jan 25 '19 at 00:13
  • Good luck with your problem. – 15ee8f99-57ff-4f92-890c-b56153 Jan 25 '19 at 00:44
  • BTW, I think you are under a misapprehension about looping through the rows: the `CopyToDataTable` effectively has to loop through the rows, and using `Distinct` would definitely loop through the rows, so it is more a matter of minimizing the work when looping through the rows. – NetMage Jan 25 '19 at 01:15

1 Answers1

0

I'm not sure of the performance with 500k records, but here is an attempted solution.

Since you are combining two subsets of DataRows from different tables, there are no easy operations that will create the subset or create a new DataTable from the subsets (though I have an extension method for flattening an IEnumerable<anon> where anon = new { DataRow1, DataRow2, ... } from a join, it would probably be slow for you).

Instead, I pre-create an answer DataTable with the columns requested and then use LINQ to build the value arrays to be added as the rows.

public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
    var rtnCols1 = dt1.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName)).ToList();
    var rc1 = rtnCols1.Select(dc => dc.ColumnName).ToList();
    var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1.Contains(dc.ColumnName)).ToList();
    var rc2 = rtnCols2.Select(dc => dc.ColumnName).ToList();

    var work = from dataRows1 in dt1.AsEnumerable()
               join dataRows2 in dt2.AsEnumerable()
               on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
               select (from c1 in rc1 select dataRows1[c1]).Concat(from c2 in rc2 select dataRows2[c2]).ToArray();

    var result = new DataTable();
    foreach (var rc in rtnCols1)
        result.Columns.Add(rc.ColumnName, rc.DataType);
    foreach (var rc in rtnCols2)
        result.Columns.Add(rc.ColumnName, rc.DataType);

    foreach (var rowVals in work)
        result.Rows.Add(rowVals);

    return result;
}

Since you were using query syntax, I did as well, but normally I would probably do the select like so:

select rc1.Select(c1 => dataRows1[c1]).Concat(rc2.Select(c2 => dataRows2[c2])).ToArray();

Updated: It is probably worthwhile to use the column ordinals instead of the names to index into each DataRow by replacing the definitions of rc1 and rc2:

var rc1 = rtnCols1.Select(dc => dc.Ordinal).ToList();
var rc1Names = rtnCols1.Select(dc => dc.ColumnName).ToHashSet();
var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1Names.Contains(dc.ColumnName)).ToList();
var rc2 = rtnCols2.Select(dc => dc.Ordinal).ToList();
NetMage
  • 26,163
  • 3
  • 34
  • 55
  • Thank you NetMage - I'm leaving on vacation today, in a few hours, so I don't know if I can try this out today or 9 days from now when I'm back, but I'll certainly report back on results at some point. Thanks for your patience. – HerrimanCoder Jan 25 '19 at 13:10
  • @HerrimanCoder Gives me time to try some other optimizations. Can you give me a range of percent of join versus non-join rows in the two tables? – NetMage Jan 25 '19 at 19:40
  • I'm not sure I have that info but I'll see if I can find out. – HerrimanCoder Feb 04 '19 at 18:29
  • NetMage, I finally was able to try this out against some datasets, and it gives me an OutOfMemory exception. – HerrimanCoder Feb 04 '19 at 21:21
  • @HerrimanCoder How large are your datasets - I created some simple 500k datasets without issue using 64-bit (.Net target Any CPU) C# in LINQPad. – NetMage Feb 04 '19 at 21:31
  • Less than 600K. And my PC has 32 Gb of RAM and lots of CPU. – HerrimanCoder Feb 04 '19 at 23:52
  • @HerrimanCoder Is that 600K records? I am running 2 million rows, but how large are the rows (bytes)? How large is the data in a final answer row? – NetMage Feb 05 '19 at 00:38
  • Sorry, I need to clarify. The first join explodes the recordset to 15M records - something is wrong - I've run into this same problem repeatedly when trying to get columns from both tables. So then I joined a 2nd time (against the 15M), and that's when OutOfMemory happened, not surprisingly. The true problem is the exploding record count. Why is that happening? When I just get DataTable #1's columns back, I get a reasonable number of columns. – HerrimanCoder Feb 05 '19 at 22:23
  • @herrimCoder if you group by the keys and count the rows, how many rows per key do you have in the first join on each side? What does that look like (is it one per row or more?). – NetMage Feb 05 '19 at 22:37
  • Sorry for lateness NetMage. I know how to GROUP BY in sql, but how would I do it in linq within the context of my current code? – HerrimanCoder Feb 14 '19 at 18:30
  • My question was for you to count/determine from your data the answers (not solve the issue yet) and that could help point to why you are running out of memory. – NetMage Feb 14 '19 at 19:38