2

The data source is a collection of CSV files so there is no actual database. This is an integration with a decades-old legacy system based in Japan.

I have a c# function that needs to take 2 DataTables and 2 column names as params. My function needs to do the equivalent of an INNER JOIN on these 2 datatables, and then return all columns from the first table, and only the "joined column" from the second table.

The schemas (read: columns) of these datatables won't be known until runtime, so the function cannot have any hardcoded column names. My function at last needs to return a new DataTable with the inner-joined data, and a DISTINCTed resultset based on the select list as just specified.

Here is my [modified] attempt, which seems to produce a promising resultset:

public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField) {
   DataTable result = ( from dataRows1 in dt1.AsEnumerable()
                        join dataRows2 in dt2.AsEnumerable()
                        on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                        select dataRows1).CopyToDataTable();
   return result;
}

I call it like this:

Common.JoinDataTables2(dtCSV, _dtModelOptions, "CMODEL", "ModelID");

My objective is to perform an inner join like in a physical database, with a distincted resultset based on the resultset specified above. You may wonder why I'm not simply doing the join in the database. It's because there is no database; the data comes from CSV files generated from a third-party system.

So I have 3 remaining problems:

  1. I'm unsure whether the resultset I'm getting back is correct based on INNER JOIN behavior.
  2. The select list doesn't include the "join column" for the 2nd datatable (in this particular example, that would be "ModelID"), and I need it to. Once it does that, I can confirm that the CMODEL values match the ModelID values, and thus confirm that I have a valid join. (That's just 1 scenario, it will be different, so no column names can be hardcoded in the function.)
  3. How to DISTINCT the resultset?

Here's one concrete example from my system, but again, the datatables and schemas will all be different:

dtCSV columns:

  1. CMODEL
  2. CATT_CD
  3. NSTAND
  4. CAPPLY1
  5. CAPPLY2
  6. DREFIX_D

_dtModelOptions columns:

  1. SeriesID
  2. ModelID
  3. OptionID

What changes do I need for my function so that:

  1. It does an INNER JOIN and a DISTINCT (is it doing this already?)
  2. It selects all columns from the first table and only the "join-column" from the second table(currently it only gets the first table's columns)
  3. Performance is as fast as possible (I was previously foreaching through records to achieve the join, and that approach was terribly slow.)
halfer
  • 19,824
  • 17
  • 99
  • 186
HerrimanCoder
  • 6,835
  • 24
  • 78
  • 158
  • Try removing `from r in lj.DefaultIfEmpty()` – Chetan Jan 08 '19 at 00:56
  • To write query for inner join with a condition you need to make two anonymous types (one for left table and one for right table) by using new keyword and compare both the anonymous types – jazb Jan 08 '19 at 00:56
  • @ChetanRanpariya that is very promising - I'm getting many fewer results now - but how do I select all columns from the first table and only the "joined column" from the 2nd table? (I don't know column names until run time) – HerrimanCoder Jan 08 '19 at 01:05
  • 2
    What exactly is the goal of this method? You're joining two separate and distinct tables on a common field, but selecting rows from the first with no filtering. You're essentially creating a cross product of the tables, but discarding the items of the second table. Your results are only going to contain rows from the first table duplicated. – Jeff Mercado Jan 08 '19 at 01:21
  • 1
    I'm not sure what benefit you get from doing this over just using linq on the actual tables. You seem to be making things more complex for less benefit. Is there a reason I'm missing to create this method? – Ben Jan 08 '19 at 02:16
  • I just want to join 2 datatables at runtime (unknown at design time), and select all columns from the first table and just the "joined column" from the 2nd table. Just like an inner join against an actual database. – HerrimanCoder Jan 08 '19 at 16:00
  • There are no actual tables. The data sources are CSV files that get loaded into c# DataTables. – HerrimanCoder Jan 17 '19 at 00:36
  • Sorry, but linq query will be always slower than sql query, due to the way they access to data. Do you want to know how to achieve that? – Maciej Los Jan 17 '19 at 21:02

4 Answers4

1

Earlier Soution ...

public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField) {
   DataTable result = ( from dataRows1 in dt1.AsEnumerable()
                            join dataRows2 in dt2.AsEnumerable()
                            on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField) 
                            select new {Col1= datarows1Field<string>(table1FieldName), Col2= datarows2.Field<string>(table2FieldName)}).Distinct().CopyToDataTable();
   return result;
}

You can list all the columns from table1 in select query. Following query has per-defined DataTable with all columns from table1 and just key column from table2. It may help you.

public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField)
{
    DataTable joinTable = new DataTable();
    foreach (DataColumn dt1Column in dt1.Columns)
    {
        joinTable.Columns.Add(dt1Column.ColumnName, dt1Column.DataType);
    }

    var col2 = dt2.Columns[table2KeyField];
    joinTable.Columns.Add(col2.ColumnName,typeof(string));

    var result = (from dataRows1 in dt1.AsEnumerable()
                  join dataRows2 in dt2.AsEnumerable()
                      on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                  select new
                  {
                      Col1 = dataRows1,
                      Col2 = dataRows2.Field<string>(table2KeyField)
                  });
    foreach (var row in result)
    {
        DataRow dr = joinTable.NewRow();
        foreach (DataColumn dt1Column in dt1.Columns)
        {
            dr[dt1Column.ColumnName] = row.Col1[dt1Column.ColumnName];
        }

        dr[table2KeyField] = row.Col2;
        joinTable.Rows.Add(dr);
    }
    joinTable.AcceptChanges();
    return joinTable.AsEnumerable().Distinct().CopyToDataTable();
}
Sria Pathre
  • 182
  • 1
  • 10
  • Sria, what I'm looking for, and as mentioned in my original post, is that I need to return ALL columns from the first datatable, and only the "join column" of the 2nd datatable. Your solution returns only 2 columns - 1 from each table. Can you modify your solution? – HerrimanCoder Jan 11 '19 at 22:57
  • @HerrimanCoder please check updated version. It may work for you. To avoid listing all the columns, I'm adding columns from table1 in loop. – Sria Pathre Jan 14 '19 at 21:24
  • 2 problems: (1) The foreach() makes it very slow - I was doing something similar before and wanted to get away from the slowness; (2) the join is doing some kind of cross join and yielding more than 1M records instead of 100K records as expected. – HerrimanCoder Jan 16 '19 at 22:41
  • It's not doing cross join definitely but if you are working on 100k records, I can suggest you to use database join and not linq. – Sria Pathre Jan 17 '19 at 14:26
  • As mentioned in post and comments, there is no database. The data source is CSV files from a legacy customer system. If I could join directly on the database I wouldn't have needed to post this question. – HerrimanCoder Jan 18 '19 at 14:18
  • Oh yes. Sorry missed that. Did you find your solution or are you still trying to optimise it? – Sria Pathre Jan 18 '19 at 16:20
  • Still waiting for a correct and performant solution - no foreach. – HerrimanCoder Jan 20 '19 at 04:56
  • this is the fastest way I have found so far. Tried 1 million records in table1 and half million in table2. It generates join records in 3 secs. – Sria Pathre Jan 23 '19 at 15:51
  • 1
    Sorry for long delay, I was finally able to test and verify this against real data - works great!! Thank you. – HerrimanCoder Mar 07 '19 at 00:45
0

[Update #3]

  1. I'm unsure whether the resultset I'm getting back is correct based on INNER JOIN behavior.

A resultset returned by linq query represents exactly what you've written in a query.

  1. The select list doesn't include the "join column" for the 2nd datatable (in > this particular example, that would be "ModelID"), and I need it to.

The answer is very simple: your query returns data only from first datatable (BTW: you already mentioned it in the issue description).

Once it does that, I can confirm that the CMODEL values match the ModelID values, and thus confirm that I have a valid join. (That's just 1 scenario, it will be different, so no column names can be hardcoded in the function.)

You can be sure that a Linq2DataSet query returns correct ID's. They must match to be able to join them. If there's no match, a resultset will be empty! Seems, you have to improve your knowledge about joins. Please, read this excellent article: Visual Representation of SQL Joins

A short version of related article:

left join

Set1 = [1, 2, 3, 5]
Set2 = [2, 4, 5]
Resultset = [1,2,5] //get [1] from left (set1), [2,5] are common items (set1 and set2)

inner join

Set1 = [1, 2, 3, 5]
Set2 = [2, 4, 5]
Resultset = [2,5] //only common items (set1 and set2)

right join

Set1 = [1, 2, 3, 5]
Set2 = [2, 4, 5]
Resultset = [2,4,5] // gets [2] from right (set2), [4,5] are common (set1 and set2)

cross join

cross join returns the cartesian product of the sets
  1. How to DISTINCT the resultset?

There's a Distinct method.

But i'm not sure, you really need this ;(

General note:

There's few ways to read delimited files (*.csv):

1) using "standard" read text file methods and spliting text into parts in a [for] loop(s)

See: A Fast CSV Reader

2) using linq methods, i.e: Select()

Note: Most of programmers well know that linq methods will be much slower than [for] loop when working on big datasets.
To be able to project fields from joined tables, you have to use:

select new {datarows1, datarows2}

If you would like to create dynamic columns using Linq, see: Query datatable with dynamic column names using LINQ


Here is a complete code how to join two datatables into single datatable: dotnetfiddle


3) using OleDb: OleDbConnection, OleDbCommand

See:
Using OleDb To Import Text Files tab CSV Custom
Read Text File Specific Columns

Your extension method might look like:

public static DataTable OleDbJoin(string csv1, string csv2, string key1, string key2)
{
    DataTable dt = new DataTable();

    string sConn = string.Format(@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0}\;Extended Properties='text;HDR=No;FMT=CSVDelimited()';", Path.GetDirectoryName(csv1));
    string sSql = string.Format(@"SELECT T.*
        FROM (
            SELECT * FROM [{0}] AS t1
            INNER JOIN (SELECT * FROM [{1}]) AS t2
                ON t1.[{2}] = t2.[{3}]) AS T;",
            Path.GetFileName(csv1), Path.GetFileName(csv2), key1, key2);

    try
    {
        using (OleDbConnection oConn = new OleDbConnection(sConn))
        {
            using (OleDbCommand oComm = new OleDbCommand(sSql, oConn))
            {
                oConn.Open();
                OleDbDataReader oRdr = oComm.ExecuteReader();
                dt.Load(oRdr);
                oComm.Dispose();
                oRdr.Dispose();
                oConn.Close();
                oConn.Dispose();
            }
        }
    }
    catch(OleDbException ex)
    {
        Console.WriteLine(ex.Message);
    }
    catch(Exception ex)
    {
        Console.WriteLine(ex.Message);
    }

    return dt;
}

Call:

DataTable resultDt = OleDbJoin("FullFileName1", "FullFileName2", "F1", "F2");

Requirements:
- both csv files have to be in the same directory
- csv files using standard separator for csv files, ee: Schema.ini file
- there's no headers in files (no column names)

Maciej Los
  • 8,468
  • 1
  • 20
  • 35
  • Maciej, thanks for your very thorough answer. One of the most important pieces remains unanswered: how to retrieve columns from the 2nd datatable. I believe there must be a way, because it can be done declaratively at code time; there must be some way to do it at runtime, with schema unknown until runtime. If you can answer that I'll award you the answer. Example, I want to be able to pass in an array of column names into the function, for the column values I want as part of the joined datatable. example: `{"SalesID", "FirstName", "SomeCode"}` - etc. – HerrimanCoder Jan 21 '19 at 19:24
  • The link in your [Update] seems promising, but how will it work inside my posted example? Please post my example with the dynamic columns. Thanks. – HerrimanCoder Jan 22 '19 at 01:41
  • @HerrimanCoder, I've updated my answer again. Please find a link to dotnetfiddle. There you'll find a complete code how to join two tables into single datatable. Please, check a `OleDbJoin` method too. It's very interesting. – Maciej Los Jan 22 '19 at 20:50
0

There is a bit of ambiguity, but from what I understand you need to Join two tables and get a rows from both (or less) of them after applying a Distinct() on the results of the Join. All this, given that the columns are not predefined.

Here is my solution:

  1. Add a Result class to wrap the outcome of your Join

    public class Result
    {
        public DataRow Table1Row { get; set; }
        public DataRow Table2Row { get; set; }
    
        public string DistictFieldValue { get; set; }
    }
    
  2. Add a ResultComparer class to help you with your own logic to get Distinct() results

    public class ResultComparer : IEqualityComparer<Result>
    {
        public bool Equals(Result x, Result y)
        {
            // Your logic to get distinct elements
            return x.DistictFieldValue == y.DistictFieldValue;
        }
    
        public int GetHashCode(Result obj)
        {
            return 0; // To enforce the Equals() gets callled.
        }
    }
    
  3. Update your method to use the above classes

    public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField)
    {
        // Join with final selection containing rows from both the tablles
        var query = from dataRows1 in dt1.AsEnumerable()
                    join dataRows2 in dt2.AsEnumerable()
                        on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                    select new Result
                    {
                        Table1Row = dataRows1,
                        Table2Row = dataRows2,
                        DistictFieldValue = dataRows2[table2KeyField].ToString() // This could be anything else, even passed as an argument to the method
                    };
    
        // Dictinct on the results above
        var queryWithDistictResults = query.Distinct(new ResultComparer());
    
        // Write your logic to convert the Results Collection to a single data table with whatever columns you want
        DataTable result = queryWithDistictResults // <= YOUR LOGIC HERE
    
        return result;
    }
    
bit
  • 4,407
  • 1
  • 28
  • 50
  • bit, this seems very promising, but I'm puzzled at the return result, which seems to be returning 2 rows + some arbitrary value. I'm wanting to return all the rows from the join, and all columns from the 1st table + just the join column from the 2nd table. Does your solution do that? I want to try it out, but first if you could help me understand. Thanks. – HerrimanCoder Jan 23 '19 at 13:46
  • A dilemma I have with your approach is that I end up with 2 independent and unrelated `DataRow`s for each Result object. Is there an easy way to combine those into a single datarow and into a datatable? If so, I can then remove the columns I don't want and return the final datatable. – HerrimanCoder Jan 23 '19 at 18:29
  • They aren't really unrelated, they are the outcome of the JOIN. To join the 2 rows you could take a look at this: https://stackoverflow.com/questions/16945569/join-2-datarow-rows-into-one-row-c-sharp – bit Jan 24 '19 at 08:36
  • I need to avoid `foreach`es because I'm dealing with 1/2 million rows of data each time, and that really slows things down. When I got away from foreach and went to linq joins my performance improved remarkably - so I can't go back to loops. Isn't there some way to have 1 set of datarows and to define my desired columns inside the `new{}` ? – HerrimanCoder Jan 24 '19 at 13:01
0

If every CSV-file represents one table of your database, then Consider doing something similar as entity framework.

Instead of IQueryable<...> let your DbSets implement IEnumerable<...>

If you only need to fetch the data, this will be fairly easy. If you also want to update, you'll need to implement (or re-use) a DbChangeTracker

public DbSet<T> : IEnumerable<T> where T: class
{
    public FileInfo CsvFile {get; set;}

    public IEnumerator<T> GetEnumerator()
    {
        return this.ReadCsvFile().GetEnumerator();
    }
    IEnumerator IEnumerable.GetEnumerator()
    {
        return this.GetEnumerator();
    }

    protected IEnumerable<T> ReadCsvFile()
    {
        // open the CsvFile, read the lines and convert to objects of type T
        // consider using Nuget package CsvHelper
        ...
        foreach (var csvLine in csvLines)
        {
            T item = Create<T>(csvLine); // TODO: write how to convert a line into T
            yield return T;
        }
    }
}

You'll also need a DbContext that holds all your DbSets:

class DbContext
{
      public DbSet<School> Schools {get; } = new DbSet<School>{CsvFile = ...};
      public DbSet<Teacher> Teachers {get; } = new DbSet<Teacher> {CsvFile = ...};
      public DbSet<Student> Students {get; } = new DbSet<Student> {CsvFile = ...};
}

You can improve performance by remembering already fetched items. Put them in a Dicationary, use the primary key as dicationary key. Also add a Find function to the DbSet:

class DbSet<T> : IEnumerable<T>
{
    private readonly Dictionary<int, T> fetchedItems = new Dictionary<int, T>();

    public T Find(int id)
    {
        if (!fetchedItems.TryGetValue(id, out T fetchedItem))
        {
            // fetch elements using ReadCsvFile and put them in the Dictionary
            // until you found the item with the requested primary key
            // or until the end of your sequence
        }
        return fetchedItem;
    }
}

It's most easy if every table item has the same type of primary key:

interface IPrimaryKey
{
     int Id {get;}
}

class DbSet<T> : IEnumerable<T> where T : IPrimaryKey {...}

If not, you'll need to tell your DbSet the type of primary key:

class DbSet<T, TKey> : IEnumerable<T> where T : class
{
     private readonly Dictinary<TKey, T> fetchedItems = ...
}

If you decide to keep your items in a Dictionary, then let your GetEnumerator first return the already fetchedItems, before fetching new rows from your CSV-file.

Add / Update / Remove items

For this, you need to be able to Add / Update / Remove items from your CsVFile. I assume there are already functions for that.

To do the update efficiently you'll need something similar to DbContext.SaveChanges. Let every DbSet remember what items to Add / Remove / Update, using a ChangeTracker:

class Entity<T> where T : IPrimaryKey
{
    public T Value {get; set;}
    public T OriginalValue {get; set;}
}

class ChangeTracker<T, TKey> where T: ICloneable
{
    readonly Dictionary<int, Entity<T, TKey>> fetchedEntities = new Dictionary<int, Entity<T, TKey>>
    readonly List<T> itemsToAdd = new List<T>();

    public T Add(T item)
    {
        // TODO: check for not NULL, and Id == 0
        this.ItemsToAdd.Add(itemToAdd);
        return item;
    }
    public void Remove(T item)
    {
        // TODO: check not null, and primary key != 0
        Entity<T> entityToRemove = Find(item.Id);
        // TODO: decide what to do if there is no such item
        entityToRemove.Value = null;
        // null indicates it is about to be removed
    }

You'll need a Find that remembers the original value:

public Entity<T> Find(TKey primaryKey)
{
    // is it already in the Dictionary (found before)?
    // if not: get it from the CsvDatabase and put it in the dictionary
    if (!fetchedItems.TryGetValue(primaryKey, out Entity<T> fetchedEntity))
    {
        // not fetched yet, fetch if from your Csv File
        T fetchedItem = ...
        // what to do if does not exist?
        // add to the dictionary:
        fetchedEntities.Add(new Entity<T>
        {
            value = fetchedItem,
            originalValue = (T)fetchedItem.Clone(),
            // so if value changes, original does not change
        });
    }
    return fetchedItem;
}

Finally your SaveChanges()

void SaveChanges()
{
    // your CsvFile database has functions to add / update / remove items
    foreach (var itemToAdd in itemsToAdd)
    {
        csvDatabase.Add(itemToAdd);
    }

    // update or remove fetched items with OriginalValue unequal to Value
    var itemsToUpdate = this.fetchedItems
        .Where(fetchedItem => !ValueComparer.Equals(fetchedItem.OriginalValue, fetchedItem.Value)
        .ToList();

    foreach (Entity<T> itemToUpdate in itemsToUpdate)
    {
        if (itemToUpdate.Value == null)
        {   // remove
            csvFile.Remove(itemToUpdate.OriginalValue);
        }
        else
        {   // update
            csvFile.Update(...);
        } 
    }
}

Apparently if you want to be able to update items in your database you need to be able to check whether items have changed. You'll need an IEqualityComparer<T> that checks by value

class DbChangeTracker<T, TKey> : IEnumerable<T> where T : class
{
     public IEqualityComparer<T> ValueComparer {get; set;}
     ...
}

DbSet SaveChanges:

void SaveChanges()
{
    this.ChangeTracker.SaveChanges();
}

DbContext SaveChanges:

Students.SaveChanges()
Teachers.SaveChanges();
Schools.SaveChanges();
Harald Coppoolse
  • 28,834
  • 7
  • 67
  • 116
  • Reading the CSV data into a datatable is already fast and easy; not the problem I'm trying to solve. But thanks anyway for your thoughtful answer. I'm just looking for an elegant way to return columns known only at runtime resulting from a LINQ join of 2 datatables. – HerrimanCoder Jan 23 '19 at 17:52