2

I'm trying to do some stateful JSON parsing with serde and serde_json. I started by checking out How to pass options to Rust's serde that can be accessed in Deserialize::deserialize()?, and while I nearly have what I need, I seem to be missing something crucial.

What I'm trying to do is two-fold:

  1. My JSON is extremely large -- far too big to just read the input into memory -- so I need to stream it. (FWIW, it also has many nested layers, so I need to use disable_recursion_limit)
  2. I need some stateful processing, where I can pass some data to the serializer that will affect what data is kept from the input JSON and how it might be transformed during serialization.

For example, my input might look like:

{ "documents": [
    { "foo": 1 },
    { "baz": true },
    { "bar": null }
    ],
    "journal": { "timestamp": "2023-04-04T08:28:00" }
}

Here, each object within the 'documents' array is very large, and I only need a subset of them. Unfortunately, I need to first find the key-value pair "documents", then I need to visit each element in that array. For now, I don't care about other key-value pairs (such as "journal"), but that might change.

My current approach is as follows:

use serde::de::DeserializeSeed;
use serde_json::Value;

/// A simplified state passed to and returned from the serialization.
#[derive(Debug, Default)]
struct Stats {
    records_skipped: usize,
}

/// Models the input data; `Documents` is just a vector of JSON values,
/// but it is its own type to allow custom deserialization
#[derive(Debug)]
struct MyData {
    documents: Vec<Value>,
    journal: Value,
}

struct MyDataDeserializer<'a> {
    state: &'a mut Stats,
}

/// Top-level seeded deserializer only so I can plumb the state through
impl<'de> DeserializeSeed<'de> for MyDataDeserializer<'_> {
    type Value = MyData;

    fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        let visitor = MyDataVisitor(&mut self.state);
        let docs = deserializer.deserialize_map(visitor)?;
        Ok(docs)
    }
}

struct MyDataVisitor<'a>(&'a mut Stats);

impl<'de> serde::de::Visitor<'de> for MyDataVisitor<'_> {
    type Value = MyData;

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
        write!(formatter, "a map")
    }

    fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error>
    where
        A: serde::de::MapAccess<'de>,
    {
        let mut documents = Vec::new();
        let mut journal = Value::Null;

        while let Some(key) = map.next_key::<String>()? {
            println!("Got key = {key}");
            match &key[..] {
                "documents" => {
                    // Not sure how to handle the next value in a streaming manner
                    documents = map.next_value()?;
                }

                "journal" => journal = map.next_value()?,
                _ => panic!("Unexpected key '{key}'"),
            }
        }

        Ok(MyData { documents, journal })
    }
}

struct DocumentDeserializer<'a> {
    state: &'a mut Stats,
}

impl<'de> DeserializeSeed<'de> for DocumentDeserializer<'_> {
    type Value = Vec<Value>;

    fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        let visitor = DocumentVisitor(&mut self.state);
        let documents = deserializer.deserialize_seq(visitor)?;
        Ok(documents)
    }
}

struct DocumentVisitor<'a>(&'a mut Stats);

impl<'de> serde::de::Visitor<'de> for DocumentVisitor<'_> {
    type Value = Vec<Value>;

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
        write!(formatter, "a list")
    }

    fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
    where
        A: serde::de::SeqAccess<'de>,
    {
        let mut agg_map = serde_json::Map::new();

        while let Some(item) = seq.next_element()? {
            // If `item` isn't a JSON object, we'll skip it:
            let Value::Object(map) = item else { continue };

            // Get the first element, assuming we have some
            let (k, v) = match map.into_iter().next() {
                Some(kv) => kv,
                None => continue,
            };

            // Ignore any null values; aggregate everything into a single map
            if v == Value::Null {
                self.0.records_skipped += 1;
                continue;
            } else {
                println!("Keeping {k}={v}");
                agg_map.insert(k, v);
            }
        }
        let values = Value::Object(agg_map);
        println!("Final value is {values}");

        Ok(vec![values])
    }
}

fn main() {
    let fh = std::fs::File::open("input.json").unwrap();
    let buf = std::io::BufReader::new(fh);
    let read = serde_json::de::IoRead::new(buf);

    let mut state = Stats::default();
    let mut deserializer = serde_json::Deserializer::new(read);

    let mydata = MyDataDeserializer { state: &mut state }
        .deserialize(&mut deserializer)
        .unwrap();

    println!("{mydata:?}");
}

This code runs successfully and properly deserializes my input data. The problem is that I can't figure out how to stream the 'documents' array one element at a time. I don't know how to change documents = map.next_value()?; into something that will pass the state down to a DocumentDeserializer. It should maybe use something like:

let d = DocumentDeserializer { state: self.0 }
    .deserialize(&mut map)
    .unwrap();

But .deserialize expects a serde::Deserializer<'de>, but map is serde::de::MapAccess<'de>.

This entire thing seems excessively verbose, anyway, so I'm open to another approach if this isn't generally accepted or idiomatic. As the OP in the linked question noted, all this boilerplate is off-putting.

user655321
  • 1,572
  • 2
  • 16
  • 33

1 Answers1

1

Your question is well researched so I kind of doubt the solution can be this simple, but don't you just want

"documents" => {
    documents = map.next_value_seed(DocumentDeserializer { self.0 })?;
}

Playground

(Personally, I wouldn't name these things …Deserializer. …Seed maybe?)

Caesar
  • 6,733
  • 4
  • 38
  • 44
  • Thank you, @Caesar, that seems to work just fine. Clearly, I just don't grok the serde API yet. Do you have any suggestions on how this might be simplified? Almost 150LOC to do this seems a bit excessive to me. – user655321 Apr 05 '23 at 14:03
  • If you're willing to go for "terrible hack", you could skip `MyDataVisitor` with some global state… better not. Though I'm not sure 150LoC is too terrible. I've implemented something similar in Java/Jackson before, it's 730 lines even without seeds (in all fairness, that's for a full real-world problem). One thing you might eventually be able to do once you've grokked the API is to propose some container attributes for just passing through some seed to a field's deserialize impl. I could see that being useful [occasionally](https://stackoverflow.com/a/71333006/401059). – Caesar Apr 05 '23 at 23:10
  • One more alternative would be to ditch serde entirely and go for [some](https://lib.rs/crates/json-tools) json token stream reader. I could even imagine implementing a `Deserializer` for that token stream so you can still use serde for the inner values. Benefit of that approach would be that the `Deserializer` is stateless and can be tested independently, maybe even contributed back upstream to the token stream crate (behind a feature flag, of course). – Caesar Apr 06 '23 at 06:43
  • 1
    @Caesar, I wrote a JSON library called [Struson](https://crates.io/crates/struson) which might be suitable for this. The code needed for this question is a bit shorter than the serde_json code, but it is more indented / deeply nested. If you or OP are interested in it, I can post it in a separate answer. Though the library is still experimental and performance is not that good yet. – Marcono1234 Aug 25 '23 at 21:56
  • @Marcono1234 Ah, cool, I see you have [serde](https://docs.rs/struson/latest/struson/serde/index.html) integration. I think that's pretty much what I imagined, so I don't think I absolutely need to see this example spelled out. (The only thing I'm a bit worried about is that you mention meagre performance. If one is going through the trouble of implementing a streaming parser, it's likely to much some quite big JSON objects… Oh well, I guess you're just being modest.) – Caesar Aug 26 '23 at 11:53