I'm looking for a way to return a dynamic column list from a LINQ join of two datatables.
First, this is not a duplicate. I have already studied and discarded:
C# LINQ list select columns dynamically from a joined dataset
Creating a LINQ select from multiple tables
How to do a LINQ join that behaves exactly like a physical database inner join?
(and many others)
Here is my starting point:
public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
DataTable result = ( from dataRows1 in dt1.AsEnumerable()
join dataRows2 in dt2.AsEnumerable()
on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
[...I NEED HELP HERE with the SELECT....]).CopyToDataTable();
return result;
}
A few notes and requirements:
- There is no database engine. The data sources are large CSV files (500K+ records) being read into c#
DataTable
s. - Because the CSVs are large, looping through each record in the join is a bad solution for performance reasons. I've already tried record looping and it's just too slow. I get great performance on the join above, but I can't find a way to have it return just the columns I want (specified by the caller) without looping records.
- If I need to loop over columns in the join, that is perfectly fine, I just don't want to loop rows.
- I want to be able to pass in an array of column names and return just those columns in the resulting
DataTable
. If both datatables being passed in happen to have a column named the same, and if that column is in my array of column names, just pass back either column because the data will be the same between the 2 columns in that case. - If I need to pass in 2 arrays (1 for each datatable's desired columns) that's fine, but 1 array of column names would be ideal.
- The column list cannot be static and hardcoded into the function. The reason is because my
JoinDataTables()
is called from many different places in my system in order to join a wide variety of CSVs-turned-datatables, and each CSV file has very different columns. - I don't want all columns returned in the resulting
DataTable
-- just the columns I specify in thecolumns
array.
So suppose, before calling JoinDataTables()
, I have the following 2 datatables:
Table: T1
T1A T1B T1C T1D
==================
10 AA H1 Foo1
11 AB H1 Foo2
12 AA H2 Foo1
13 AB H2 Foo2
Table: T2
T2A T2X T2Y T2Z
==================
12 N1 O1 Yeah1
17 N2 O2 Yeah2
18 N3 O1 Yeah1
19 N4 O2 Yeah2
Now suppose we join these 2 tables like so:
ON T1.T1A = T2.T2A
select * from [join]
and that yields this resultset:
T1A T1B T1C T1D T2A T2X T2Y T2Z
====================================
12 AA H2 Foo1 12 N1 O1 Yeah1
Notice that only 1 row is yielded by the join.
Now to the crux of my question. Suppose that for a given use case, I want to return only 4 columns from this join: T1A, T1D, T2A, and T2Y. So my resultset would then look like this:
T1A T1D T2A T2Y
==================
12 Foo1 12 O1
I'd like to be able to call my JoinDataTables
function like so:
DataTable dt = JoinDataTables(dt1, dt2, "T1A", "T2A", new string[] {"T1A", "T1D", "T2A", "T2Y"});
Keeping in mind performance and the fact that I don't want to loop through records (because it's slow for large sets of data), how can this be accomplished? (The join is already working well, now I just need a correct select
segment (whether via new{..}
or whatever you think)).
I cannot accept a solution with a hardcoded column list inside the function. I have found examples of that approach all over SO
.
Any ideas?
EDIT: I'd be ok getting ALL columns back every time, but every attempt I've made to include all columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join.