Lazily reading a parquet file with binary datatype in PyPolars

Question

I hope this is a good question, if I should post this as an issue on the PyPolars GitHub instead, please let me know.

I have a quite large parquet file where some columns contain binary data.

These columns are not interesting for me right now, so it is ok for me that PyPolars does not support the Binary datatype so far (this is how I understand it at least, my question would be irrelevant if that were not the case!), but I would like to make full use of the query optimization by lazily reading the file with .scan_parquet() instead of read_parquet().

Currently .scan_parquet() gives me the following error:

pyo3_runtime.PanicException: Arrow datatype Binary not supported by Polars. You probably need to activate that data-type feature.

and I don't know of a way to 'activate that data-type feature'

So my workaround is to use .read_parquet() and specify in advance which columns I want to use so that it never attempts to read the Binary ones.

The problem is I am doing exploratory data analysis and there are a large amount of columns so for one it is annoying to have to specify a large list of columns (basically ~150 minus the two that produce the issue) and it is also inefficient to read all these columns each time when I only need some small subset each time (it is even more annoying to change a small list of columns each time I, for example, add some filter).

It would be ideal if I could use .scan_parquet and let the query optimizer figure out that it only needs to read the (unproblematic) columns that I actually need.

Is there a better way of doing things that I am not seeing?

That datatype is not yet supported. But we almost support it. There is a PR for that binary type ready to be merged in the coming week. — ritchie46, Oct 05 '22 at 15:59

Lazily reading a parquet file with binary datatype in PyPolars

0 Answers0