0

I'm trying to return a list of dictionaries(coming from a python background) from this rust function in which I read a csv file using polars library. I think the data type I need to use is Vec<Vec> in this case, if not please correct me.

I've written the following function,

fn read_csv_file(path: &str) -> Vec<Vec<AnyValue>> {
    let file = File::open(path).expect("could not open file");
    let df = CsvReader::new(file)
        .infer_schema(None)
        .has_header(true)
        .finish()
        .unwrap();

    let df_height = df.height();
    // Get all the rows from dataframe
    let mut i = 0;
    let mut rows = Vec::new();
    while i < df_height {
        let row = df.get(i).unwrap();
        rows.push(row.to_owned());
        i += 1;
    }

    return rows;
}

but when I try to call it,

error[E0515]: cannot return value referencing local variable `df`
  --> src/main.rs:50:12
   |
40 |         let row = df.get(i).unwrap();
   |                   --------- `df` is borrowed here
...
50 |     return rows;
   |            ^^^^ returns a value referencing data owned by the current function

For more information about this error, try `rustc --explain E0515`.

I tried writing .to_owned() to various parts of the function, but no luck :). Stackoverflow usually gives examples related to borrowed values, but I'm not exactly sure what is borrowed here(it says df, but the row shouldn't be a reference to df at this point).

I'm a bit lost and looking for some help understanding what's going on with my function.

gkaykck
  • 2,347
  • 10
  • 35
  • 52
  • The things inside a row have a lifetime tied to the DF they're from; they're, `AnyValue<'a>`s, where `'a` is the lifetime of the borrow of `self` from the `df.get` call. This whole function seems a little odd though. A DataFrame is basically nothing but a `Vec>` only much more performant and ergonomic. Why would you want to do this instead of just returning and working with a DF? – isaactfa Aug 27 '22 at 17:11
  • @isaactfa I'm trying to convert the data structure from columnar to row-based due to requirements from the javascript library I'm using on front end. – gkaykck Aug 27 '22 at 17:13
  • I would imagine there has to be a more lightweight solution for that than going through a DF solely to parse a .csv file, no? Can't you use any other csv parsing utility to do this? – isaactfa Aug 27 '22 at 17:18
  • This is not the entire implementation, I'm planning to do some data querying within the dataframe – gkaykck Aug 27 '22 at 17:24
  • 1
    You could try using `into_static` on the `AnyValue`s but that only works if they don't borrow any data from the DF. DataFrame rows are just not meant to be easily manipulated. – isaactfa Aug 27 '22 at 18:00

1 Answers1

0

Level-set; we understand the borrow checker

To level-set, owning something, where that something relies on a borrowed value, "corrupts" if you will, what you think you own... owning something that hosts a borrowed value still means you are relying on a borrow.

Moving ownership from df to row

The idea of moving ownership from the DataFrame to rows: Vec<Vec<AnyValue>> seems reasonable enough. However, the to_owned() or clone() did not work as expected. At one level, this could be considered a bug. However, I suspect the bug is more ever giving the impression that this would be possible in the first place. I even tried to use a memreplace to take the Vec<AnyValue<'_> in the Option (see Option::take) and still got the same error.

The answer lies in the AnyValue<'_>

Looking again at AnyValue<'_> in the function return position, we see that we get "something" that is borrowed. We know this by the generic lifetime parameter in the type. So, in my estimation, what is happening has less to do with df but more to do with the fact that the AnyValue<'_> hosts a borrowed value where we cannot ever take ownership.

The to_owned function returns ownership of a reference to a borrowed value. So, the limit of the DataFrame in a strongly typed language.

The design intent - columns, not rows

DataFrame's power comes from organizing data by columns, not rows. So if you want rows, this may not be the best choice of data structure in the first place (sqlite perhaps?). The underlying data structure Apache Arrow is a singly typed array with extra powers (this is where polars can no longer be compared to python's pandas; a good thing). It's not a multi-typed structure per se. A row on the other hand has to be "truly" multi-typed. So is more like a tuple in Rust.

In closing

All in all, there is surely a way to move ownership from a DataFrame to something organized by row, but that would require a different data structure. The Vec<Anytype<'_>> ain't it.

If you know the types of each column ahead of time, you may be able to take ownership one cell at a time (copy or clone) at the chunked array level of the structure. The hosting struct could be a tuple or Vec that hosts an enum or trait object. I like a tuple here because you have a fixed width (fixed columns) so don't need the flexibility of a Vec.

Edmund's Echo
  • 766
  • 8
  • 15