I have a heterogeneous container that has a timestamp, and one of an inner type that I am serializing in to json in batches and loading into dataframes.
#[derive(serde::Serialize)]
pub struct Container {
pub timestamp: Option<Timestamp>,
pub sender_uid: u64,
#[serde(flatten)]
pub msg: Option<Msg>,
}
pub enum Msg {
ComponentOneStatus(ComponentOneStatus),
ComponentTwoStatus(ComponentTwoStatus),
// ... several more options
}
struct ComponentOneStatus {
fieldA: Vec<f32>
fieldB: SomeOtherSruct
/// ....many more fields
}
struct ComponentTwoStatus {
fieldC: u64
fieldD: YetAnotherStruct
/// ....many more fields
}
The JSON looks like:
[
{
timestamp: {seconds: 21121212, nanos: 1212121}
ComponentOneStatus: {fieldA: [0.3, -2.3, 3.3], fieldB: {nestedOther1: 4 ...}}
},
{
timestamp: {seconds: 434334, nanos: 1212}
ComponentTwoStatus: {fieldC: 9, fieldD: {differentNestedProp: 4 ...}}
}
]
I immediately run some analytics on that batch in a dataframe, then I hold a reference to it and diagonally concatenate it with previous batches and store those in a parquet file:
let combined = diag_concat_lf([df1.lazy(), df2.lazy(), df3.lazy()], true, true)?
.collect()?;
let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis();
let buf = File::create(format!("data/parquet/{now}.parquet"))?;
ParquetWriter::new(buf).finish(&mut combined)?;
Post-diagnoal concatenation, all of the records in the parquet file have all of the fields for all of the nested, for instance, a record that contained Component2Status would look like
{
timestamp: {...},
ComponentOneStatus: { fieldA: NULL, fieldB: NULL},
ComponentTwoStatus: {fieldC: 9, fieldD: {differentNestedProp: 4 ...}}
}
is it possible to lift the nulls so it is stored as ComponentOneStatus: NULL
?
Additionally, I get a stack overflow if I try and diag_concat more than ~10 batches (1024 records each)
Should I instead concatenate the json buffers and re-deserialiaze? Is there a way to skip the json step?
Note that the nested objects are very complex, protobuf messages, so I would prefer not manually writing a schema because they are externally moving targets.
Thanks!