1

I'm trying to do a "database side" bulk copy (i.e. SELECT INTO/INSERT INTO) using linq2db. However, my code is trying to bring the dataset over the wire which is not possible given the size of the DB in question.

My code looks like this:

using (var db = new MyDb()) {
    var list = db.SourceTable.
        Where(s => s.Year > 2012).
        GroupBy(s => new { s.Column1, s.Column2 }).
        Select(g => new DestinationTable {
            Property1 = 'Constant Value',
            Property2 = g.First().Column1,
            Property3 = g.First().Column2,
            Property4 = g.Count(s => s.Column3 == 'Y')
        });

    db.Execute("TRUNCATE TABLE DESTINATION_TABLE");
    db.BulkCopy(new BulkCopyOptions {
      BulkCopyType = BulkCopyType.MultipleRows
    }, list);
}

The generated SQL looks like this:

BeforeExecute
-- DBNAME SqlServer.2017

TRUNCATE TABLE DESTINATION_TABLE
 DataConnection
Query Execution Time (AfterExecute): 00:00:00.0361209. Records Affected: -1.
 DataConnection
BeforeExecute
-- DBNAME SqlServer.2017
DECLARE @take Int -- Int32
SET     @take = 1
DECLARE @take_1 Int -- Int32
SET     @take_1 = 1
DECLARE @take_2 Int -- Int32
...
SELECT
        (
                SELECT TOP (@take)
                        [p].[YEAR]
                FROM
                        [dbo].[SOURCE_TABLE] [p]
                WHERE
                        (([p_16].[YEAR] = [p].[YEAR] OR [p_16].[YEAR] IS NULL AND [p].[YEAR] IS NULL) AND ...
...)
FROM SOURCE_TABLE p_16
WHERE p_16.YEAR > 2012
GROUP BY
  ...
DataConnection

That is all that is logged as the bulkcopy fails with a timeout, i.e. SqlException "Execution Timeout Expired".

Please note that running this query as an INSERT INTO statement takes less than 1 second directly in the DB.

PS: Anyone have any recommendations as to good code based ETL tools to do large DB (+ 1 TB) ETL. Given the DB size I need things to run in the database and not bring data over the wire. I've tried pyspark, python bonobo, c# etlbox and they all move too much data around. I thought linq2db had potential, i.e. basically just act like a C# to SQL transpiler but it is also trying to move data around.

Svyatoslav Danyliv
  • 21,911
  • 3
  • 16
  • 32
gatapia
  • 3,574
  • 4
  • 40
  • 48

2 Answers2

2

I would suggest to rewrite your query because group by can not return first element. Also Truncate is a part of the library.

var sourceQuery = 
   from s in db.SourceTable 
   where s.Year > 2012
   select new 
   {
      Source = s,
      Count = Sql.Ext.Count(s.Column3 == 'Y' ? 1 : null).Over()
        .PartitionBy(s.Column1, s.Column2).ToValue()
      RN = Sql.Ext.RowNumber().Over()
        .PartitionBy(s.Column1, s.Column2).OrderByDesc(s.Year).ToValue()
   };

db.DestinationTable.Truncate();

sourceQuery.Where(s => s.RN == 1)
  .Insert(db.DestinationTable, 
    e => new DestinationTable 
    {
       Property1 = 'Constant Value',
       Property2 = e.Source.Column1,
       Property3 = e.Source.Column2,
       Property4 = e.Count
    });
Svyatoslav Danyliv
  • 21,911
  • 3
  • 16
  • 32
1

After some investigation I stumbled onto this issue. Which lead me to the solution. The code above needs to change to:

db.Execute("TRUNCATE TABLE DESTINATION_TABLE");
db.SourceTable.
  Where(s => s.Year > 2012).
  GroupBy(s => new { s.Column1, s.Column2 }).
  Select(g => new DestinationTable {
    Property1 = 'Constant Value',
    Property2 = g.First().Column1,
    Property3 = g.First().Column2,
    Property4 = g.Count(s => s.Column3 == 'Y')
  }).Insert(db.DestinationTable, e => e);

Documentation of the linq2db project leaves a bit to be desired however, in terms of functionality its looking like a great project for ETLs (without horrible 1000s of line copy/paste sql/ssis scripts).

Svyatoslav Danyliv
  • 21,911
  • 3
  • 16
  • 32
gatapia
  • 3,574
  • 4
  • 40
  • 48