I'm trying to read a the same file using multiple Goroutines, where each Goroutine is assigned a byte to start reading from and a number of lines to read lineLimit
.
I was successful in doing so when the file fits in memory by setting the csv.ChunkSize
option to the chunkSize
variable. However, when the file is larger than memory, I need to reduce the csv.ChunkSize
option. I was attempting something like this
package main
import (
"io"
"log"
"os"
"sync"
"github.com/apache/arrow/go/v11/arrow"
"github.com/apache/arrow/go/v11/arrow/csv"
)
// A reader to read lines from the file starting from the byteOffset. The number
// of lines is specified by linesLimit.
func produce(
id int,
ch chan<- arrow.Record,
byteOffset int64,
linesLimit int64,
filename string,
wg *sync.WaitGroup,
) {
defer wg.Done()
fd, _ := os.Open(filename)
fd.Seek(byteOffset, io.SeekStart)
var remainder int64 = linesLimit % 10
limit := linesLimit - remainder
chunkSize := limit / 10
reader := csv.NewInferringReader(fd,
csv.WithChunk(int(chunkSize)),
csv.WithNullReader(true, ""),
csv.WithComma(','),
csv.WithHeader(true),
csv.WithColumnTypes(map[string]arrow.DataType{
"Start_Time": arrow.FixedWidthTypes.Timestamp_ns,
"End_Time": arrow.FixedWidthTypes.Timestamp_ns,
"Weather_Timestamp": arrow.FixedWidthTypes.Timestamp_ns,
}))
reader.Retain()
defer reader.Release()
var count int64
for reader.Next() {
rec := reader.Record()
rec.Retain() // released at the other end of the channel
ch <- rec
count += rec.NumRows()
if count == limit {
if remainder != 0 {
flush(id, ch, fd, remainder)
}
break
} else if count > limit {
log.Panicf("Reader %d read more than it should, expected=%d, read=%d", id, linesLimit, count)
}
}
if reader.Err() != nil {
log.Panicf("error: %s in line %d,%d", reader.Err().Error(), count, id)
}
}
func flush(id int,
ch chan<- arrow.Record,
fd *os.File,
limit int64,
) {
reader := csv.NewInferringReader(fd,
csv.WithChunk(int(limit)),
csv.WithNullReader(true, ""),
csv.WithComma(','),
csv.WithHeader(false),
)
reader.Retain()
defer reader.Release()
record := reader.Record()
record.Retain() // nil pointer dereference error here
ch <- record
}
I tried multiple versions of this previous code, including:
- Copying the file descriptor
- Copying the offset of the file descriptor, opening the same file and seeking to that offset.
- Closing the first reader before calling
flush
or closing the firstfd
.
The error seems to be the same no matter how I change the code. Note that any call to flush
's reader raises an error. Includingreader.Next
, and reader.Err()
.
Am I using the csv readers wrong? Is this a problem with reusing the same file?
EDIT: I don't know if this helps, but opening a new fd in flush
without any Seek
avoids the error (Somehow any Seek
causes the original error to appear). However, the code is not correct without a Seek
(i.e. removing Seek
causes a part of the file to not be read at all by any Goroutine).