Using Nessos.Stream to reduce the time to get and write query output to a `.CSV`

Question

In the past, I have written queries that just get the entirety of the query, store the results in memory, then feed the whote sequence to a .CSV type provider. A query example:

let results = 
    query {
        for row in db.ThisRow do 
        select row
        }
        |> Seq.toList

As I understand it, the Seq.toList part forces the query to run while keeping the output from the query itself as a Seq would be lazy. If the number of the query results are small, this isn't a big deal. If, however, the number of results is large (e.g., >1Mil rows), I end up with a System.OutOfMemoryException. As a result, a friend suggested in look into using the Nessos.Stream library.

My goal is to pull a row from the query, do some operations to that row, then write that single row to a .CSV and do that iteratively for each row with all rows ultimately in the same .CSV file.

So, I try

open Nessos.Stream

type StringInt = {
    String: string option
    Int: int
    }

type StringIntCsvType = CsvProvider<Sample = "item_number, num", 
                                    Schema = "item_number (string option), num (int)", 
                                    HasHeaders = true>
let buildRowFromObject (obj: StringInt) = StringIntCsvType.Row(obj.String, 
                                                               obj.Int)

let results = 
    query {
        for row in db.ThisRow do 
        select row
        }
        |> Stream.ofSeq
        |> Stream.groupBy (fun row -> row.ITEMNUMBER)
        |> Stream.map (fun (itemString, seq) -> (itemString, (seq |> Seq.length)))
        |> Stream.map (fun (str, num) -> {String = Some str; 
                                          Int = num})
        |> Stream.map buildRowFromObject
        |> Stream.toSeq

let ThisCsv= new StringIntCsvType(results)
let ThisCsvLoc = "pathToFileLocation"
let ThisCsv.Save(ThisCsvLoc)

Although this works inasmuch as I'm getting all of the rows I've queried from the database, this still isn't doing what I want it do and I'm not sure how to make that happen. Also, for larger queries, I still end up with an System.OutOfMemoryException.

I think I should be able to insert

|> Stream.map (fun x -> new StringIntCsvType(x))

under the |> Stream.map buildRowFromObject line, but the x is errored out with

Type Constraint Mismatch.  The type
    CsvProvider<...>.Row
is not compatible with type
    Collections.Generic.IEnumerable<CsvProvider<...>.Row>

Even if I wasn't given this error, I think adding that line would create a new .CSV for every row, and that is decidedly NOT want I want.

How I can write a query, operate on each different piece of the query, and write each row of the query into the same .CSV file? Am I even close with the code above?

Instead of bringing the whole table in the client why don't you use `group by` in your `query`? Remove the `groupBy |> map` calls and change your query to something like `query { for row in db.ThisRow do groupBy row.ItemNumber into g select (g.Key, g.Count()) }` — krontogiannis, Oct 11 '16 at 17:07
Sure, I can do this, and it speeds up the query, but it doesn't help with the rest of the question. — Steven, Oct 11 '16 at 17:21
the error occurs because something is expecting an array of Rows but you are passing one Row. Wrap it in an array or a seq. But generally you might need to rethink the approach and split the problem. If you can do some of the processing on the db side that's best, that's what they are for. Then outputting it to CSV, you have many different ways, csvprovider, FileHelper , just plain strings, you should be able to process this lazily. Can we get an idea of the size involved? Is this like a terabyte database from where you are extracting a table that's like 100GB? Where do you get the errors? — s952163, Oct 11 '16 at 23:42

s952163 · Accepted Answer · 2016-10-12T14:35:53.297

You don't necesary need the intermediate StringInt record type. And you already have your CSV sequence from buildRowFromObject, you can directly write that to a file. While I'm also a big fan of Nessos Streams I'm not sure it adds that much value for you here (I might be wrong and maybe you have some more complicated map/reduce operations in the pipeline). In this case, before exploring other solutions, I would go for the simple one first.

First, make sure your FSI is set to 64-bit and you're also compiling to 64-bit in the exe. Second, as @krontogiannis comments, your operation appears to be counting items in some group. I don't think you want to actually iterate through each group so why not do the Count on the DB side. You will get back a sequence of tuples (I hope), where you have some ID and the countumber. You can feed that directly to the CSVTypeprovider. So:

type StringIntCsvType = CsvProvider<Sample = "item_number, num", 
                                    Schema = "item_number (string), num (int)", 
                                    HasHeaders = true>

let buildRowFromObject (row:string * int) = StringIntCsvType.Row(row)

let qry = query {
            for row in tbl1 do
            groupBy row.ItemNumber into g
            select (g.Key,g.Count())
            }

let csvout = qry |> Seq.map  buildRowFromObject
(new StringIntCsvType(csvout)).Save(@"C:\tmp\test.csv")

This writes about 7 million rows in 9 secs into a ca. 100MB csv file.

In a more complex applications you might factor out the writing to file into something like this:

let writeFile csvout (path:string) =
    use csvtype = new  StringIntCsvType(csvout)
    csvtype.Save(path)
writeFile csvout @"C:\tmp\test2.csv"

There is also a comment in the manual about working with a large number of rows, not sure how relevant it is for you but you can set the CacheRows=false when instantiating the CsvProvider.

Edit

Some if this might not be relevant but here it is:

#load @"..\..\FSLAB\packages\FsLab\FsLab.fsx"

#r "System.Data.dll"
#r "FSharp.Data.TypeProviders.dll"
#r "System.Data.Linq.dll"
#r @"..\packages\Streams.0.4.1\lib\net45\Streams.dll"
#r @"..\packages\FileHelpers.3.1.5\lib\net45\FileHelpers.dll"

open System
open System.Diagnostics
open System.IO
open System.Collections
open System.Collections.Generic
open System.IO.Compression

open System.Data
open System.Data.Linq
open System.Linq
open Microsoft.FSharp.Data.TypeProviders
open FSharp.Linq
open FSharp.Data
open Nessos.Streams
open FileHelpers

[<Literal>]
let connectionString2 = @"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=C:\Users\...\Documents\test.mdf;Connection Timeout = 60"

type dbSchema = SqlDataConnection<connectionString2,StoredProcedures = true>


type StringIntCsvType = CsvProvider<Sample = "item_number, num", 
                                    Schema = "item_number (string), num (int)", 
                                    HasHeaders = true,CacheRows=false>



let getDbx() = 
    let dbx = dbSchema.GetDataContext()
    dbx.DataContext.ObjectTrackingEnabled <- false  // could impact memory consumption but so far has little effect on time
    dbx.DataContext.CommandTimeout <- 90
    dbx

let dbx = getDbx()

let tbl1 = dbx.MyTable

I was talking with my boss about `Nessos.Streams` yesterday and, given that `Seq` is already lazy, it seemed that `.Streams` use here was unnecessary. Also, streamlining the code to eliminate the record type and the grouping in the pipeline provides much more legible code. I'm running into a new error now. When I run the final line (which actually runs the query on the server), I get the error `System.InvalidOperationException: Could not format node 'New' for execution as SQL.`. — Steven, Oct 12 '16 at 13:45
@Steven Well, seq is about the iterator and if you chain a lot together even when it's lazy it can bog you down. Can I assume this error occurs irrespective of the csv part. I.e. were you to do `qry |> Seq.toList` to manifest it, you'd get the same error? Any more details about the setup would also be helpful, type of database, version, type provider. Above I'm using the stock SqlDataProvider to access SQLServer, if you use something else maybe it won't have groupby for example. — s952163, Oct 12 '16 at 14:01
You're right: with the `groupBy` and `qry |> Seq.toList` the error occurs. The dBase is `SQL Server 11.0.2218` and the type provider I'm using is `Fsharp.Data.TypeProviders.SqlDataConnection`. The `groupBy` *does* exist in that provider as I've used that function for other queries. [This SO Q/A](http://stackoverflow.com/questions/5890160/could-not-format-node-value-for-execution-as-sql) provides some indication that I'm not alone. — Steven, Oct 12 '16 at 14:11
@Steven So SQL Server 2012. I don't have access to one now. This error does not seem straightforward to fix, and I'm a bit puzzled by what could cause this message. Is your table complicated (some sort of view)? If you can test it on an SQL Server 2014-16 that would be ideal. Otherwise you might need to run the groupBy on the client. I will add the rest of my the stuff I'm referencing in a bit. — s952163, Oct 12 '16 at 14:25
Interesting. I dropped the `Stream`s, kept everything as `Seq` and streamlined the whole pipeline by eliminating the my record type, and running the query/writing to `.CSV` for the largest of my queries (where I'm pulling ~1.3million rows of ~10million total rows) worked without issue. It took some time (`Real: 00:06:54.127`), but ran successfully. Now I'm wondering if having too many `Seq` chained together was causing things to be bogged down on my end while I was also still pulling from the database? — Steven, Oct 12 '16 at 14:30
@Steven you might have been iterating through the same stuff multiple times. I recall a Seq.length in there too. So I assume this is running on the client? A few million rows shouldn't be an issue but depends on memory of course. You might want to look at [PSeq](http://fsprojects.github.io/FSharp.Collections.ParallelSeq/) to speed up things. — s952163, Oct 12 '16 at 14:35

Using Nessos.Stream to reduce the time to get and write query output to a `.CSV`

1 Answers1