The data source is a collection of CSV files so there is no actual database. This is an integration with a decades-old legacy system based in Japan.
I have a c# function that needs to take 2 DataTables
and 2 column names as params. My function needs to do the equivalent of an INNER JOIN
on these 2 datatables, and then return all columns from the first table, and only the "joined column" from the second table.
The schemas (read: columns) of these datatables won't be known until runtime, so the function cannot have any hardcoded column names. My function at last needs to return a new DataTable with the inner-joined data, and a DISTINCTed resultset based on the select list as just specified.
Here is my [modified] attempt, which seems to produce a promising resultset:
public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField) {
DataTable result = ( from dataRows1 in dt1.AsEnumerable()
join dataRows2 in dt2.AsEnumerable()
on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
select dataRows1).CopyToDataTable();
return result;
}
I call it like this:
Common.JoinDataTables2(dtCSV, _dtModelOptions, "CMODEL", "ModelID");
My objective is to perform an inner join like in a physical database, with a distincted resultset based on the resultset specified above. You may wonder why I'm not simply doing the join in the database. It's because there is no database; the data comes from CSV files generated from a third-party system.
So I have 3 remaining problems:
- I'm unsure whether the resultset I'm getting back is correct based on INNER JOIN behavior.
- The select list doesn't include the "join column" for the 2nd datatable (in this particular example, that would be "ModelID"), and I need it to. Once it does that, I can confirm that the CMODEL values match the ModelID values, and thus confirm that I have a valid join. (That's just 1 scenario, it will be different, so no column names can be hardcoded in the function.)
- How to DISTINCT the resultset?
Here's one concrete example from my system, but again, the datatables and schemas will all be different:
dtCSV columns:
- CMODEL
- CATT_CD
- NSTAND
- CAPPLY1
- CAPPLY2
- DREFIX_D
_dtModelOptions columns:
- SeriesID
- ModelID
- OptionID
What changes do I need for my function so that:
- It does an INNER JOIN and a DISTINCT (is it doing this already?)
- It selects all columns from the first table and only the "join-column" from the second table(currently it only gets the first table's columns)
- Performance is as fast as possible (I was previously
foreach
ing through records to achieve the join, and that approach was terribly slow.)