Data being inserted row-by-row rather than in batches

Question

I am trying to batch up rows from a large json doc, and insert them into SQL Server.

The following code works, but inserts 1 row at a time into the SQL Server table. I think this is actually every 1000th row.

add-type -path "C:\Program Files\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\newtonsoft.json.dll"
install-module sqlserver -AllowClobber

class EliteCoords {
    [double] $x
    [double] $y
    [double] $z
}

class EliteSystem {
    [int]         $id
    [long]        $id64
    [string]      $name
    [EliteCoords] $coords
    [string]      $date 
}

$dt = New-Object system.data.datatable

$dt.Columns.Add("ID",[int])
$dt.Columns.Add("ID64",[long])
$dt.Columns.Add("Name",[string])
$dt.Columns.Add("Coordsx",[decimal])
$dt.Columns.Add("Coordsy",[decimal])
$dt.Columns.Add("Coordsz",[decimal])
$dt.Columns.Add("DiscoveryDate",[string])

$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
while ($reader.Read()) {
    $cnt = 0
    $dt.Clear()
    while ($cnt -le 1000) {
        $cnt = $cnt + 1

        if ($reader.TokenType -eq 'StartObject') {

            $row = [Newtonsoft.Json.JsonSerializer]::CreateDefault().Deserialize($reader, [EliteSystem])

            $dr = $dt.NewRow()

            $dr["ID"]            = $row.id
            $dr["ID64"]          = $row.id64
            $dr["Name"]          = $row.name
            $dr["Coordsx"]       = $row.coords.x
            $dr["Coordsy"]       = $row.coords.y
            $dr["Coordsz"]       = $row.coords.z
            $dr["DiscoveryDate"] = $row.date

            $dt.Rows.Add($dr)       
        }
    }

    write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
}

$stream.Close()

I know that the problem is because the if block is being skipped, for all but the first iteration of the inner while loop, because the token type is changing from StartObject to StartArray.

I tried putting an additional reader loop inside but of course, that reads the whole file.

I also tried just reading the array rather than object, but of course, that fails because of the nested json.

How should I structure the loops, so that I can batch up and process 1000 rows?

Can't replicate. I attempted a basic example of this with `$dt` being a `[system.data.datatable]` type. Then I added columns to match yours. Then I looped through adding dummy data with a while loop. I ran the final `write-sqltabledata` at the end. All 1000 rows went at once and quickly. I tested with sqlserver module version 21.1.18121. — AdminOfThings, Jun 22 '19 at 16:43
If you use SQL server 2016 or above you can simply insert the JSON itself into the database, and work with it using TSQL functions. https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017 — Repeat Daily, Jun 22 '19 at 16:56
I can't insert the whole doc, for the same reason as I need to batch the rows, when shredding. The files are Hugh. This file is 5GB but I have another which is 90GB. It needs to be streamed and batched, or it will simply not be practical — Pete Carter, Jun 22 '19 at 21:33
@AdminOfThings - I have added some more detail to the question, including the schema of the destination table. It seems to be "ignoring" the nested while — Pete Carter, Jun 22 '19 at 21:34
Can you verify that line `$row = [Newtonsoft.Json.JsonSerializer]::CreateDefault().Deserialize($reader, [EliteSystem])` produces a different output every time `$cnt` increases or is the goal to have 1000 copies of the same data? — AdminOfThings, Jun 23 '19 at 12:12
Can you insert a debug line inside of the if statement so that you can see how many times it actually enters that scriptblock like just have it output $cnt to the console when you run it? — AdminOfThings, Jun 23 '19 at 12:21

dbc · Answer 1 · 2019-06-24T04:33:02.183

Your problem is that you are flushing and clearing the table for every call to $reader.Read() -- i.e., every row.

Instead, you need to accumulate rows until you reach 1000, and then flush:

$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
try {
    $serializer = [Newtonsoft.Json.JsonSerializer]::CreateDefault()
    while ($reader.Read()) {
        # If the reader is positioned at the start of an object then accumulate a row.
        if ($reader.TokenType -eq 'StartObject') {                
            $row = serializer.Deserialize($reader, [EliteSystem])

            $dr = $dt.NewRow()

            $dr["ID"]            = $row.id
            $dr["ID64"]          = $row.id64
            $dr["Name"]          = $row.name
            $dr["Coordsx"]       = $row.coords.x
            $dr["Coordsy"]       = $row.coords.y
            $dr["Coordsz"]       = $row.coords.z
            $dr["DiscoveryDate"] = $row.date

            $dt.Rows.Add($dr)       
        }

        # If we have accumulated 1000 rows, flush them.
        if ($dt.Rows.Count -ge 1000) {
            write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
            $dt.Clear()
        }
    }

    # Flush any remaining rows.
    if ($dt.Rows.Count -ge 0) {
        write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
        $dt.Clear()
    }
}
finally {
    $reader.Close()
    $stream.Close()
}

Notes:

You should probably dispose of the StreamReader and JsonTextReader in a finally block in the event of an exception. For a pretty way to do this, see How to implement using statement in powershell?.
Allocating the serializer only once should improve performance a little, at no cost.
Without an example of the JSON file we can't say whether there are additional problems with the EliteSystem data model used for each row. For instance, if the JSON file is actually a jagged 2d array this might not work.

score 0 · Answer 2 · answered Jun 24 '19 at 07:33

The answer was to replace the inner loop with a break. Then add an outer loop, which continues until the End of Stream marker hits.

add-type -path "C:\Program Files\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\newtonsoft.json.dll"
#install-module sqlserver -AllowClobber

class EliteCoords {
    [double] $x
    [double] $y
    [double] $z
}

class EliteSystem {
    [int]         $id
    [long]        $id64
    [string]      $name
    [EliteCoords] $coords
    [string]      $date 
}

$dt = New-Object system.data.datatable

$dt.Columns.Add("ID",[int])
$dt.Columns.Add("ID64",[long])
$dt.Columns.Add("Name",[string])
$dt.Columns.Add("Coordsx",[decimal])
$dt.Columns.Add("Coordsy",[decimal])
$dt.Columns.Add("Coordsz",[decimal])
$dt.Columns.Add("DiscoveryDate",[string])

$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
while ($stream.EndOfStream -eq $false) {
$cnt = 0
    $dt.Clear()
    while ($reader.Read()) {

        if ($reader.TokenType -eq 'StartObject') {

                $row = [Newtonsoft.Json.JsonSerializer]::CreateDefault().Deserialize($reader, [EliteSystem])

                $dr = $dt.NewRow()

                $dr["ID"]            = $row.id
                $dr["ID64"]          = $row.id64
                $dr["Name"]          = $row.name
                $dr["Coordsx"]       = $row.coords.x
                $dr["Coordsy"]       = $row.coords.y
                $dr["Coordsz"]       = $row.coords.z
                $dr["DiscoveryDate"] = $row.date

                $dt.Rows.Add($dr)       


        $cnt = $cnt + 1


        }
        if ($cnt -gt 9999) {break}
    }
    write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt -Timeout 0
}

$stream.Close()

`JsonTextReader` buffers its input so it's possible that `stream.EndOfStream` could be true before the last row has actually been deserialized. — dbc, Jun 24 '19 at 07:41

Data being inserted row-by-row rather than in batches

2 Answers2