Multiple Arrow CSV Readers on same file returns null

Question

I'm trying to read a the same file using multiple Goroutines, where each Goroutine is assigned a byte to start reading from and a number of lines to read lineLimit.

I was successful in doing so when the file fits in memory by setting the csv.ChunkSize option to the chunkSize variable. However, when the file is larger than memory, I need to reduce the csv.ChunkSize option. I was attempting something like this

package main

import (
    "io"
    "log"
    "os"
    "sync"

    "github.com/apache/arrow/go/v11/arrow"
    "github.com/apache/arrow/go/v11/arrow/csv"
)

// A reader to read lines from the file starting from the byteOffset. The number
// of lines is specified by linesLimit.
func produce(
    id int,
    ch chan<- arrow.Record,
    byteOffset int64,
    linesLimit int64,
    filename string,
    wg *sync.WaitGroup,
) {
    defer wg.Done()

    fd, _ := os.Open(filename)
    fd.Seek(byteOffset, io.SeekStart)

    var remainder int64 = linesLimit % 10
    limit := linesLimit - remainder
    chunkSize := limit / 10

    reader := csv.NewInferringReader(fd,
        csv.WithChunk(int(chunkSize)),
        csv.WithNullReader(true, ""),
        csv.WithComma(','),
        csv.WithHeader(true),
        csv.WithColumnTypes(map[string]arrow.DataType{
            "Start_Time":        arrow.FixedWidthTypes.Timestamp_ns,
            "End_Time":          arrow.FixedWidthTypes.Timestamp_ns,
            "Weather_Timestamp": arrow.FixedWidthTypes.Timestamp_ns,
        }))
    reader.Retain()
    defer reader.Release()

    var count int64
    for reader.Next() {
        rec := reader.Record()
        rec.Retain() // released at the other end of the channel
        ch <- rec
        count += rec.NumRows()
        if count == limit {
            if remainder != 0 {
                flush(id, ch, fd, remainder)
            }
            break
        } else if count > limit {
            log.Panicf("Reader %d read more than it should, expected=%d, read=%d", id, linesLimit, count)
        }
    }

    if reader.Err() != nil {
        log.Panicf("error: %s in line %d,%d", reader.Err().Error(), count, id)
    }
}

func flush(id int,
    ch chan<- arrow.Record,
    fd *os.File,
    limit int64,
) {
    reader := csv.NewInferringReader(fd,
        csv.WithChunk(int(limit)),
        csv.WithNullReader(true, ""),
        csv.WithComma(','),
        csv.WithHeader(false),
    )

    reader.Retain()
    defer reader.Release()

    record := reader.Record()
    record.Retain() // nil pointer dereference error here
    ch <- record
}

I tried multiple versions of this previous code, including:

Copying the file descriptor
Copying the offset of the file descriptor, opening the same file and seeking to that offset.
Closing the first reader before calling flush or closing the first fd.

The error seems to be the same no matter how I change the code. Note that any call to flush's reader raises an error. Includingreader.Next, and reader.Err().

Am I using the csv readers wrong? Is this a problem with reusing the same file?

EDIT: I don't know if this helps, but opening a new fd in flush without any Seek avoids the error (Somehow any Seek causes the original error to appear). However, the code is not correct without a Seek (i.e. removing Seek causes a part of the file to not be read at all by any Goroutine).

Zeke Lu · Accepted Answer · 2023-05-02T06:24:16.433

The main issue is that, the csv reader uses a bufio.Reader underneath, which has a default buffer size 4096. That means reader.Next() will read more bytes than needed, and cache the extra bytes. If you read directly from the file after reader.Next(), you will miss the cached bytes.

The demo below shows this behavior:

package main

import (
    "bytes"
    "fmt"
    "io"
    "os"

    "github.com/apache/arrow/go/v11/arrow"
    "github.com/apache/arrow/go/v11/arrow/csv"
)

func main() {
    // Create a two-column csv file with this content (the second column has 1024 bytes):
    // 0,000000....
    // 1,111111....
    // 2,222222....
    // 3,333333....
    temp := createTempFile()

    schema := arrow.NewSchema(
        []arrow.Field{
            {Name: "i64", Type: arrow.PrimitiveTypes.Int64},
            {Name: "str", Type: arrow.BinaryTypes.String},
        },
        nil,
    )
    r := csv.NewReader(
        temp, schema,
        csv.WithComma(','),
        csv.WithChunk(3),
    )
    defer r.Release()

    r.Next()

    // To check what's left after the first chunk is read.
    // If the reader stop at the end of the chunk, the content left will be:
    // 3,333333....
    // But in fact, the content left is:
    // 33333333333
    buf, err := io.ReadAll(temp)
    if err != nil {
        panic(err)
    }

    fmt.Printf("%s\n", buf)
}

func createTempFile() *os.File {
    temp, err := os.CreateTemp("", "test*.csv")
    if err != nil {
        panic(err)
    }
    for i := 0; i < 4; i++ {
        fmt.Fprintf(temp, "%d,", i)
        if _, err := temp.Write(bytes.Repeat([]byte{byte('0' + i)}, 1024)); err != nil {
            panic(err)
        }
        if _, err := temp.Write([]byte("\n")); err != nil {
            panic(err)
        }
    }

    if _, err := temp.Seek(0, io.SeekStart); err != nil {
        panic(err)
    }

    return temp
}

It seems that the purpose of the second reader is to prevent it from reading into another block of csv data. If you know the offset of the next block of csv data in advance, you can wrap the file in an io.SectionReader to make it read only the current block of csv data. The current question does not provide enough information about this part, maybe we should leave it for another question.

Notes:

fd, _ := os.Open(filename): Never ignore errors. At least log them.
fd means file descriptor most of the time. Don't use it for a variable of type *os.File, especially when *os.File has a method Fd.

Thank you for your answer, I'm aware of caching, but assumed it only caches the chunk and therefore each offset is at the start of a newline. I didn't know about `SectionReader`, it now works and the code is much simpler. Regarding the notes: 1. The file is opened multiple times before calling produce (to fetch byte offsets, limits and line limits), all errors are handled then. 3. In the 2nd to last line, I wrote a comment that the error is a nil dereference, the full error is just the call stack. — Mohamed Yasser, May 02 '23 at 04:38
Thank you for feedback! I haven't noticed the comment about the nil dereference error. The doc for the `Next` func says `If a parse failure occurs, Next will return true and the Record will contain nulls where failures occurred. Subsequent calls to Next will return false - The user should check Err() after each call to Next to check if an error took place`. Have you checked what is returned from `reader.Err()`? — Zeke Lu, May 02 '23 at 04:56
Yes, I read the docs before posting the question, and moved the Error check inside the for loop. The error was always null and the panic came from the `flush`'s reader. — Mohamed Yasser, May 02 '23 at 05:13
Building on your explanation of caching, I believe the problem was that the offset I was trying to continue reading from was always in the middle of a line. Therefore, the csv reader couldn't read because it couldn't infer the schema and didn't accept the schema I provided it with normal csv readers. — Mohamed Yasser, May 02 '23 at 05:19
Yes. That's correct. Regarding the error, sorry that I did not make it clear. The reader created in the `flush` func reads invalid csv data, and the method `Err()` of this reader should give us a meaningful message. The code `if reader.Err() != nil {` in the `produce` func can not get the error of the reader created in the `flush` func. — Zeke Lu, May 02 '23 at 05:27
Trying to call `.Next()` or `.Err()` on `flush`'s reader, raises the nil dereference error. (I am amazed that .`.Record()` doesn't raise it). That's why I couldn't debug it and assumed the error was in my usage of file descriptor. — Mohamed Yasser, May 02 '23 at 05:53
That's weird. Anyway, since the main issue has been addressed, let's ignore the error part. I'm going to remove note 3 from my answer. Thank you! — Zeke Lu, May 02 '23 at 06:23

Multiple Arrow CSV Readers on same file returns null

1 Answers1

Linked