I'm trying to split some events that I'm collecting from Kafka
based on a time interval. Basically, the goal is to read the value in the datetime
column, run a simple formula to check if the current event falls in the current interval
. If yes, then append the event to the RecordBuilder
otherwise flush
the group of events (a segment
) to a parquet
file.
Here is the code I have so far:
type Segment struct {
mu sync.Mutex
schema *arrow.Schema
evtStruct *arrow.StructType
builder *array.RecordBuilder
writer *pqarrow.FileWriter
timestampIndex int
}
func NewSegment(dir string, datetimeFieldName string, schema *arrow.Schema) (*Segment, error) {
// other inits here
// ...
// create a parquet file
pFile, err := os.Create(fileName)
if err != nil {
return nil, err
}
w, err := pqarrow.NewFileWriter(schema, pFile, props, pqarrow.DefaultWriterProps())
if err != nil {
panic(err)
}
// create the new record builder for inserting data to arrow
mem := memory.NewCheckedAllocator(memory.NewGoAllocator())
b := array.NewRecordBuilder(mem, schema)
evtStruct := arrow.StructOf(schema.Fields()...)
idx, ok := evtStruct.FieldIdx(datetimeFieldName)
if !ok {
return nil, fmt.Errorf("couldn't find datetime column")
}
return &Segment{
schema: schema,
evtStruct: evtStruct,
mu: sync.Mutex{},
builder: b,
writer: w,
timestampIndex: idx,
}, nil
}
// data comes from Kafka and it represent a single event
func (s *Segment) InsertData(data []byte) error {
s.mu.Lock()
defer s.mu.Unlock()
// TODO: do partition by interval here
// extract the datetime value --> dtVal (unix epoch)
// assume for now an interval of 5 minutes
// dtPartition = math.floor(dtVal, 5*60)
/* if dtPartition > oldDtPartition {
return s.Flush()
}
*/
// append the data to the current builder
if err := s.builder.UnmarshalJSON(data); err != nil {
return err
}
return nil
}
// Flush persist the segment on disk
func (s *Segment) Flush() error {
s.mu.Lock()
defer s.mu.Unlock()
rec := s.builder.NewRecord()
// closable
defer s.builder.Release()
defer s.writer.Close()
defer rec.Release()
// write parquet file
if err := s.writer.WriteBuffered(rec); err != nil {
return err
}
return nil
}
The problem is that I'm not able to "Unmarshal"
the data
input parameter of the InsertData
function because there is no "struct"
that it can be Unmarshaled to. I'm able to create a arrow.Schema
and a arrow.StructType
because the service I'm making allows a user to define the schema of the event. Hence I'm trying to find a way to read the datetime
value in the event and decide in which interval
falls in.
In the function InsertData
I added some silly pseudocode of what I'd like to achieve. Perhaps Apache Arrow has some functions that can help in doing what I'm trying to do. Thank you in advance.