0

I have a heterogeneous container that has a timestamp, and one of an inner type that I am serializing in to json in batches and loading into dataframes.

#[derive(serde::Serialize)]
pub struct Container {
    pub timestamp: Option<Timestamp>,
    pub sender_uid: u64,
    #[serde(flatten)]
    pub msg: Option<Msg>,
}

pub enum Msg {
  ComponentOneStatus(ComponentOneStatus),
  ComponentTwoStatus(ComponentTwoStatus),  
  // ... several more options
}

struct ComponentOneStatus {
   fieldA: Vec<f32>
   fieldB: SomeOtherSruct
   /// ....many more fields
}


struct ComponentTwoStatus {
   fieldC: u64
   fieldD: YetAnotherStruct
   /// ....many more fields
}

The JSON looks like:

[
{
  timestamp: {seconds: 21121212, nanos: 1212121}
  ComponentOneStatus: {fieldA: [0.3, -2.3, 3.3], fieldB: {nestedOther1: 4 ...}}
},

{
  timestamp: {seconds: 434334, nanos: 1212}
  ComponentTwoStatus: {fieldC: 9, fieldD: {differentNestedProp: 4 ...}}
}
]

I immediately run some analytics on that batch in a dataframe, then I hold a reference to it and diagonally concatenate it with previous batches and store those in a parquet file:

  let combined = diag_concat_lf([df1.lazy(), df2.lazy(), df3.lazy()], true,    true)?
    .collect()?;


  let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis();
  let buf = File::create(format!("data/parquet/{now}.parquet"))?;
  ParquetWriter::new(buf).finish(&mut combined)?;

Post-diagnoal concatenation, all of the records in the parquet file have all of the fields for all of the nested, for instance, a record that contained Component2Status would look like

{
 timestamp: {...},  
 ComponentOneStatus: { fieldA: NULL, fieldB: NULL}, 
 ComponentTwoStatus: {fieldC: 9, fieldD: {differentNestedProp: 4 ...}} 
}

is it possible to lift the nulls so it is stored as ComponentOneStatus: NULL?

Additionally, I get a stack overflow if I try and diag_concat more than ~10 batches (1024 records each)

Should I instead concatenate the json buffers and re-deserialiaze? Is there a way to skip the json step?

Note that the nested objects are very complex, protobuf messages, so I would prefer not manually writing a schema because they are externally moving targets.

Thanks!

Dennis Collective
  • 193
  • 1
  • 1
  • 8

0 Answers0