2

I'm trying to split some events that I'm collecting from Kafka based on a time interval. Basically, the goal is to read the value in the datetime column, run a simple formula to check if the current event falls in the current interval. If yes, then append the event to the RecordBuilder otherwise flush the group of events (a segment) to a parquet file.

Here is the code I have so far:

type Segment struct {
    mu             sync.Mutex
    schema         *arrow.Schema
    evtStruct      *arrow.StructType
    builder        *array.RecordBuilder
    writer         *pqarrow.FileWriter
    timestampIndex int
}

func NewSegment(dir string, datetimeFieldName string, schema *arrow.Schema) (*Segment, error) {
        // other inits here
        // ...

    // create a parquet file
    pFile, err := os.Create(fileName)
    if err != nil {
        return nil, err
    }
    w, err := pqarrow.NewFileWriter(schema, pFile, props, pqarrow.DefaultWriterProps())
    if err != nil {
        panic(err)
    }

    // create the new record builder for inserting data to arrow
    mem := memory.NewCheckedAllocator(memory.NewGoAllocator())
    b := array.NewRecordBuilder(mem, schema)

    evtStruct := arrow.StructOf(schema.Fields()...)
    idx, ok := evtStruct.FieldIdx(datetimeFieldName)
    if !ok {
        return nil, fmt.Errorf("couldn't find datetime column")
    }

    return &Segment{
        schema:         schema,
        evtStruct:      evtStruct,
        mu:             sync.Mutex{},
        builder:        b,
        writer:         w,
        timestampIndex: idx,
    }, nil
}

// data comes from Kafka and it represent a single event
func (s *Segment) InsertData(data []byte) error {
    s.mu.Lock()
    defer s.mu.Unlock()

        // TODO: do partition by interval here
        // extract the datetime value --> dtVal (unix epoch)
        // assume for now an interval of 5 minutes
        // dtPartition = math.floor(dtVal, 5*60)
        /* if dtPartition > oldDtPartition {
                return s.Flush()
           }
        */

        // append the data to the current builder
    if err := s.builder.UnmarshalJSON(data); err != nil {
        return err
    }

    return nil
}

// Flush persist the segment on disk
func (s *Segment) Flush() error {
    s.mu.Lock()
    defer s.mu.Unlock()

    rec := s.builder.NewRecord()

    // closable
    defer s.builder.Release()
    defer s.writer.Close()
    defer rec.Release()

    // write parquet file
    if err := s.writer.WriteBuffered(rec); err != nil {
        return err
    }

    return nil
}

The problem is that I'm not able to "Unmarshal" the data input parameter of the InsertData function because there is no "struct" that it can be Unmarshaled to. I'm able to create a arrow.Schema and a arrow.StructType because the service I'm making allows a user to define the schema of the event. Hence I'm trying to find a way to read the datetime value in the event and decide in which interval falls in.

In the function InsertData I added some silly pseudocode of what I'd like to achieve. Perhaps Apache Arrow has some functions that can help in doing what I'm trying to do. Thank you in advance.

spaghettifunk
  • 1,936
  • 4
  • 24
  • 46

1 Answers1

1

If you can do this: s.builder.UnmarshalJSON(data), then data is a JSON value. You can print data with fmt.Printf("%s", data) to confirm that.

If you're sure that every event contains a datetime column, then you can define a struct like this so that you can unmarshal the data to it:

type event struct {
    Datetime int64 `json:"datetime"`
}

Here is a small demo:

package main

import (
    "encoding/json"
    "fmt"
)

func getDatatime(data []byte) (int64, error) {
    var e struct {
        Datetime int64 `json:"datetime"`
    }
    if err := json.Unmarshal(data, &e); err != nil {
        return 0, err
    }

    return e.Datetime, nil
}

func main() {
    data := []byte(`{"datetime": 1685861011, "region": "NY", "model": "3", "sales": 742.0, "extra": 1234}`)

    fmt.Println(getDatatime(data))
}
Zeke Lu
  • 6,349
  • 1
  • 17
  • 23
  • 1
    ah totally forgot that I can partially `Unmarshal` if I know the name of one of the JSON fields! So simple! Thank you very much. – spaghettifunk Jun 04 '23 at 07:50