I'm trying to do some stateful JSON parsing with serde
and serde_json
. I started by checking out How to pass options to Rust's serde that can be accessed in Deserialize::deserialize()?, and while I nearly have what I need, I seem to be missing something crucial.
What I'm trying to do is two-fold:
- My JSON is extremely large -- far too big to just read the input into memory -- so I need to stream it. (FWIW, it also has many nested layers, so I need to use
disable_recursion_limit
) - I need some stateful processing, where I can pass some data to the serializer that will affect what data is kept from the input JSON and how it might be transformed during serialization.
For example, my input might look like:
{ "documents": [
{ "foo": 1 },
{ "baz": true },
{ "bar": null }
],
"journal": { "timestamp": "2023-04-04T08:28:00" }
}
Here, each object within the 'documents' array is very large, and I only need a subset of them. Unfortunately, I need to first find the key-value pair "documents"
, then I need to visit each element in that array. For now, I don't care about other key-value pairs (such as "journal"
), but that might change.
My current approach is as follows:
use serde::de::DeserializeSeed;
use serde_json::Value;
/// A simplified state passed to and returned from the serialization.
#[derive(Debug, Default)]
struct Stats {
records_skipped: usize,
}
/// Models the input data; `Documents` is just a vector of JSON values,
/// but it is its own type to allow custom deserialization
#[derive(Debug)]
struct MyData {
documents: Vec<Value>,
journal: Value,
}
struct MyDataDeserializer<'a> {
state: &'a mut Stats,
}
/// Top-level seeded deserializer only so I can plumb the state through
impl<'de> DeserializeSeed<'de> for MyDataDeserializer<'_> {
type Value = MyData;
fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
where
D: serde::Deserializer<'de>,
{
let visitor = MyDataVisitor(&mut self.state);
let docs = deserializer.deserialize_map(visitor)?;
Ok(docs)
}
}
struct MyDataVisitor<'a>(&'a mut Stats);
impl<'de> serde::de::Visitor<'de> for MyDataVisitor<'_> {
type Value = MyData;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a map")
}
fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error>
where
A: serde::de::MapAccess<'de>,
{
let mut documents = Vec::new();
let mut journal = Value::Null;
while let Some(key) = map.next_key::<String>()? {
println!("Got key = {key}");
match &key[..] {
"documents" => {
// Not sure how to handle the next value in a streaming manner
documents = map.next_value()?;
}
"journal" => journal = map.next_value()?,
_ => panic!("Unexpected key '{key}'"),
}
}
Ok(MyData { documents, journal })
}
}
struct DocumentDeserializer<'a> {
state: &'a mut Stats,
}
impl<'de> DeserializeSeed<'de> for DocumentDeserializer<'_> {
type Value = Vec<Value>;
fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
where
D: serde::Deserializer<'de>,
{
let visitor = DocumentVisitor(&mut self.state);
let documents = deserializer.deserialize_seq(visitor)?;
Ok(documents)
}
}
struct DocumentVisitor<'a>(&'a mut Stats);
impl<'de> serde::de::Visitor<'de> for DocumentVisitor<'_> {
type Value = Vec<Value>;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a list")
}
fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let mut agg_map = serde_json::Map::new();
while let Some(item) = seq.next_element()? {
// If `item` isn't a JSON object, we'll skip it:
let Value::Object(map) = item else { continue };
// Get the first element, assuming we have some
let (k, v) = match map.into_iter().next() {
Some(kv) => kv,
None => continue,
};
// Ignore any null values; aggregate everything into a single map
if v == Value::Null {
self.0.records_skipped += 1;
continue;
} else {
println!("Keeping {k}={v}");
agg_map.insert(k, v);
}
}
let values = Value::Object(agg_map);
println!("Final value is {values}");
Ok(vec![values])
}
}
fn main() {
let fh = std::fs::File::open("input.json").unwrap();
let buf = std::io::BufReader::new(fh);
let read = serde_json::de::IoRead::new(buf);
let mut state = Stats::default();
let mut deserializer = serde_json::Deserializer::new(read);
let mydata = MyDataDeserializer { state: &mut state }
.deserialize(&mut deserializer)
.unwrap();
println!("{mydata:?}");
}
This code runs successfully and properly deserializes my input data. The problem is that I can't figure out how to stream the 'documents' array one element at a time. I don't know how to change documents = map.next_value()?;
into something that will pass the state down to a DocumentDeserializer
. It should maybe use something like:
let d = DocumentDeserializer { state: self.0 }
.deserialize(&mut map)
.unwrap();
But .deserialize
expects a serde::Deserializer<'de>
, but map
is serde::de::MapAccess<'de>
.
This entire thing seems excessively verbose, anyway, so I'm open to another approach if this isn't generally accepted or idiomatic. As the OP in the linked question noted, all this boilerplate is off-putting.