1

I am using polars with Rust and I would like to be able to read multiple csv files as input.

I found this section in the documentation that shows how to use glob patterns to read multiple files using Python, but I could not find a way to do this in Rust.

Trying the glob pattern with Rust does not work.

The code I tried was

use polars::prelude::*;

fn main() {

    let df = CsvReader::from_path("./example/*.csv").unwrap().finish().unwrap();

    println!("{:?}", df);
}

And this failed with the error

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" })', src/main.rs:26:54
stack backtrace:
   0: rust_begin_unwind

I also tried creating the Path independently and confirm the path represents a directory,

use std::path::PathBuf;
use polars::prelude::*;

fn main() {

    let path = PathBuf::from("./example");
    println!("{}", path.is_dir());
    let df = CsvReader::from_path(path).unwrap().finish().unwrap();

    println!("{:?}", df);
}

it also fails with the same error.

So question is how do I read multiple CSV/Parquet/JSON etc files from a directory using Rust?

Finlay Weber
  • 2,989
  • 3
  • 17
  • 37
  • What do you want to do with each CSV file once they're loaded? `CsvReader::from_path` takes a value that will be converted into a`std::path::PathBuf`, which represents a single file. Can you use the standard library to get a list of files in your target directory and process them in a loop? – Jimmy Jan 15 '23 at 09:57
  • Then I'll rather start writing my own dataframe library. The idea is to create a single dataframe from the contents of the files in the directory. Having to manually process the contents defeats the utility of the library. This feature is supported by datafusion another library in the space. – Finlay Weber Jan 15 '23 at 10:07

1 Answers1

0

The section of the documentation referenced in your question uses both the library glob and a for loop in python.

Thus, we can write the rust version implementing similar ideas as follows:

eager version

use std::path::PathBuf;

use glob::glob;
use polars::prelude::*;

fn main() {
    let csv_files = glob("my-file-path/*csv")
                      .expect("No CSV files in target directory");

    let mut dfs: Vec<PolarsResult<DataFrame>> = Vec::new();

    for entry in csv_files {
        dfs.push(read_csv(entry.unwrap().to_path_buf()));
   }

   println!("dfs: {:?}", dfs);

}

fn read_csv(filepath: PathBuf) -> PolarsResult<DataFrame> {
    CsvReader::from_path(filepath)?
        .has_header(true)
        .finish()
}

lazy version

fn read_csv_lazy(filepath: PathBuf) -> PolarsResult<LazyFrame> {
  LazyCsvReader::new(filepath).has_header(true).finish()
}

fn main() {
  
  let mut ldfs: Vec<PolarsResult<LazyFrame>> = Vec::new();

  for entry in csv_files {
    ldfs.push(read_csv_lazy(entry.unwrap().to_path_buf()));
  }

  // do stuff

  for f in ldfs.into_iter() {
      println!("{:?}", f.unwrap().collect())
  }
}
Pasqui
  • 591
  • 4
  • 12