0

How do you efficiently take a parquet file in Rust and iterate over it as a list of structs?

E.g.

struct Reading {
    datetime: chrono::DateTime<chrono::Utc>,
    value: f64,
}

let filename = "readings.parquet";
let readings: Vec<Reading> = ???;

The only thing I've been able to think to do is to use parquet::file::reader::{FileReader, SerializedFileReader};, but this is extremely slow (1M rows / s)—slower even than converting the parquet to a CSV in python and then reading the csv into rust.

Current attempt:

use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::record::RowAccessor;
use std::fs::File;
use std::path::Path;

struct Reading {
    datetime: String,
    value: f64,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let time = std::time::Instant::now();
    let filename = "readings.parquet";
    let file = File::open(&Path::new(filename))?;
    let reader = SerializedFileReader::new(file)?;
    let mut iter = reader.get_row_iter(None)?;

    let mut readings: Vec<Reading> = Vec::new();

    while let Some(record) = iter.next() {
        let date: String = record.get_string(0)?.to_string();
        let time: String = record.get_string(1)?.to_string();
        let datetime = format!("{} {}", date, time);
        let last: f64 = record.get_double(2)?;
        let reading = Reading { datetime, value: last };
        readings.push(reading);
    }

    println!("time: {:?}", time.elapsed());
    println!("readings: {:?}", readings.len());

    Ok(())
}
Test
  • 962
  • 9
  • 26
  • How fast do you need it? Are you testing with `--release`? – tadman May 09 '23 at 02:49
  • 1
    Hahah yes, I am testing with --release, though I know that is a common mistake. It should be able to handle at least 10M / s. (python can do this faster) – Test May 09 '23 at 02:51
  • Can you show the reading code? It's really hard to say with `???` being the only thing we have to work with. – tadman May 09 '23 at 02:55
  • Stackoverflow refuses because it's too long. Let me try to add words. – Test May 09 '23 at 02:58
  • You can show a simple version of it. – tadman May 09 '23 at 02:59
  • It's worth testing what the raw reading speed is, without appending to the `Vec`, and especially without the parsing. – tadman May 09 '23 at 03:00
  • The parsing is part of the problem: some libraries can very rapidly read into a struct (even CSV can read into structs at about 5M/s). – Test May 09 '23 at 03:03
  • I get that, but I'm just wondering if the `get_string()` or `get_double()` calls are punishingly slow here. Parquet can be read with a schema, and that might speed things up, where `get_row_iter()` can be coached. – tadman May 09 '23 at 03:04
  • How can it be read with a schema? – Test May 09 '23 at 03:05
  • You could try ``` let parquet_metadata = reader.metadata(); let fields = parquet_metadata.file_metadata().schema().get_fields(); ``` and iterate over fields. Parquet is column oriented. So reading field by field could be more efficient. – Nikolay Zakirov May 09 '23 at 05:45
  • But the purpose as stated is to iterate over structs—sometimes computation needs to be done involving more than 1 field at a time. – Test May 09 '23 at 06:11
  • 3
    Try with: `let file = BufReader::new (File::open(&Path::new(filename))?);` (see [`BufReader`](https://doc.rust-lang.org/stable/std/io/struct.BufReader.html)). – Jmb May 09 '23 at 06:43

0 Answers0