3

I'm writing a library to parse json data that looks like this:

{"x": [[1, "a"], [2, "b"]]}

i.e. I have a key with a list of lists where the inner lists can contain different data types but each inner list has the same sequence of types. The sequence of types for an inner list can change for different json schemas but will be known ahead of time.

The desired output would look something like: vec![vec![1,2], vec!["a", "b"]] (with the data wrapped in some appropriate enum for the different dtypes).

I began implementing DeserializeSeed for Vec<DataTypes>, below is some similar pseudo-code.

enum DataTypes {
    I32,
    I64,
    String,
    F32,
    F64
}


fn visit_seq<S>(self, mut seq: S) -> Result<Self::Value, S::Error>
    where
        S: SeqAccess<'de>,
    {
        let types: Vec<DataTypes> = self.0.data;
        let out: Vec<Vec<...>>;
        while let Some(inner_seq: S) = seq.next_element::<S>()? { // <-- this is the line
           for (i, type) in types.enumerate() {
               match type {
                   DataTypes::I32 => out[i].push(inner_seq.next_element::<i32>()?),
                   DataTypes::I64 => out[i].push(inner_seq.next_element::<i64>()?),
                   ...
               }
           }
        }
    }

My problem is I can't seem to find a way to get SeqAccess for the inner lists and I don't want to deserialize them into something like Vec<serde_json::Value> because I don't want to have to allocate the additional vector.

wnorcbrown
  • 53
  • 4

1 Answers1

6

Please fasten your seat belts, this is verbose.

I'm assuming you want to deserialize some JSON data

{"x": [[1, "a"], [2, "b"]]}

to some Rust struct

struct X {
    x: Vec<Vec<Value>>, // Value is some enum containing string/int/float…
}

all while

  • transposing the elements of the inner lists while inserting into the vectors
  • checking that the inner vector elements conform to some type passed to deserialization
  • not doing any transient allocations

At the start, you have to realize that you have three different types that you want to deserialize: X, Vec<Vec<Value>>>, and Vec<Value>. (Value itself you don't need, because what you actually want to deserialize are strings and ints and whatnot, not Value itself.) So, you need three deserializers, and three visitors.

The innermost Deserialize has a mutable reference to a Vec<Vec<Value>>, and distributes the elements of a single [1, "a"], one to each Vec<Value>.

struct ExtendVecs<'a>(&'a mut Vec<Vec<Value>>, &'a [DataTypes]);
impl<'de, 'a> DeserializeSeed<'de> for ExtendVecs<'a> {
    type Value = ();
    fn deserialize<D>(self, deserializer: D) -> Result<Self::Value, D::Error>
    where
        D: Deserializer<'de>,
    {
        struct ExtendVecVisitor<'a>(&'a mut Vec<Vec<Value>>, &'a [DataTypes]);
        impl<'de, 'a> Visitor<'de> for ExtendVecVisitor<'a> {
            type Value = ();
            fn visit_seq<A>(self, mut seq: A) -> Result<(), A::Error>
            where
                A: SeqAccess<'de>,
            {
                for (i, typ) in self.1.iter().enumerate() {
                    match typ {
                        // too_short checks for None and turns it into Err("expected more elements")
                        DataTypes::Stri => self.0[i].push(Value::Stri(too_short(self.1, seq.next_element::<String>())?)),
                        DataTypes::Numb => self.0[i].push(Value::Numb(too_short(self.1, seq.next_element::<f64>())?)),
                    }
                }
                // TODO: check all elements consumed
                Ok(())
            }
        }
        deserializer.deserialize_seq(ExtendVecVisitor(self.0, self.1))
    }
}

The middle Deserialize constructs the Vec<Vec<Value>>, gives the innermost ExtendVecs access to the Vec<Vec<Value>>, and asks ExtendVecs to have a look at each of the [[…], […]]:

struct TransposeVecs<'a>(&'a [DataTypes]);
impl<'de, 'a> DeserializeSeed<'de> for TransposeVecs<'a> {
    type Value = Vec<Vec<Value>>;
    fn deserialize<D>(self, deserializer: D) -> Result<Self::Value, D::Error>
    where
        D: Deserializer<'de>,
    {
        struct TransposeVecsVisitor<'a>(&'a [DataTypes]);
        impl<'de, 'a> Visitor<'de> for TransposeVecsVisitor<'a> {
            type Value = Vec<Vec<Value>>;
            fn visit_seq<A>(self, mut seq: A) -> Result<Vec<Vec<Value>>, A::Error>
            where
                A: SeqAccess<'de>,
            {
                let mut vec = Vec::new();
                vec.resize_with(self.0.len(), || vec![]);
                while let Some(()) = seq.next_element_seed(ExtendVecs(&mut vec, self.0))? {}
                Ok(vec)
            }
        }

        Ok(deserializer.deserialize_seq(TransposeVecsVisitor(self.0))?)
    }
}

Finally, the outermost Deserialize is nothing special anymore, it just hands access to the type array down:

struct XD<'a>(&'a [DataTypes]);
impl<'de, 'a> DeserializeSeed<'de> for XD<'a> {
    type Value = X;
    fn deserialize<D>(self, deserializer: D) -> Result<Self::Value, D::Error>
    where
        D: Deserializer<'de>,
    {
        struct XV<'a>(&'a [DataTypes]);
        impl<'de, 'a> Visitor<'de> for XV<'a> {
            type Value = X;
            fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error>
            where
                A: serde::de::MapAccess<'de>,
            {
                let k = map.next_key::<String>()?;
                // TODO: check k = "x"
                Ok(X { x: map.next_value_seed(TransposeVecs(self.0))? })
            }
        }

        Ok(deserializer.deserialize_struct("X", &["x"], XV(self.0))?)
    }
}

Now, you can seed the outermost Deserialize with your desired type list and use it to deserialize one X, e.g.:

XD(&[DataTypes::Numb, DataTypes::Stri]).deserialize(
    &mut serde_json::Deserializer::from_str(r#"{"x": [[1, "a"], [2, "b"]]}"#)
)

Playground with all the left-out error handling


Side node: If you can (i.e. if the format you're deserializing is self-describing like JSON) I'd recommend to do the type checking after deserialization. Why? Because doing it during means that all deserializers up to the top deserializer must be DeserializeSeed, and you can't use #[derive(Deserialize)]. If you do the type checking after, you can #[derive(Deserialize)] and #[serde(deserialize_with = "TransposeVecs_deserialize_as_free_function")} x: Vec<Vec<Value>>, and save half of the cruft in this post.

Caesar
  • 6,733
  • 4
  • 38
  • 44
  • I hope this time nobody comes and goes "you went 720° on this, coulda just …" – Caesar Mar 03 '22 at 06:40
  • This solves my problem! Thank you so much for taking the time on that comprehensive answer. The output of these is actually going to be used as numerical arrays and so not allocating large vectors of the wrong precision is key – wnorcbrown Mar 03 '22 at 09:53
  • 1
    I should have said so right away: a rust enum takes as much space as its largest element plus a discriminant and alignment. So the size of `Value` is actually 32 byte on a 64 bit system (String has one pointer and two `usize` lengths. I assume the discriminant is 1 byte, and the other 7 are alignment). Instead of having an enum for each value, you should have an enum per inner vector: `X { x: Vec }` and `enum Values { Strings(Vec>), Numbers(Vec) } `. (That'll also allow you to drop the `&'a [DataTypes]` member of `ExtendVecs`.) Feel free to ask if you get stuck with that. – Caesar Mar 03 '22 at 11:00