7

I am converting pandas dataframe to polars dataframe but pyarrow throws error.

My code:

import polars as pl
import pandas as pd

if __name__ == "__main__":

    with open(r"test.xlsx", "rb") as f:
        excelfile = f.read()
    excelfile = pd.ExcelFile(excelfile)
    sheetnames = excelfile.sheet_names
    df = pd.concat(
        [
            pd.read_excel(
            excelfile, sheet_name=x, header=0)
                    for x in sheetnames
                    ], axis=0)

    df_pl = pl.from_pandas(df)

Error:

File "pyarrow\array.pxi", line 312, in pyarrow.lib.array

File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array

File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?

Rahil
  • 183
  • 1
  • 2
  • 11

2 Answers2

4

Edit: Polars 0.13.42 and later

Polars now has a read_excel function that will correctly handle this situation. read_excel is now the preferred way to read Excel files into Polars.

Note: to use read_excel, you will need to install xlsx2csv (which can be installed with pip).

Polars: prior to 0.13.42

I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.

For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
    return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
    pandas_to_pydf(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
    arrow_dict = {
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
    str(col): _pandas_series_to_arrow(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
    return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

There are several lengthy discussions on this issue, such as these:

This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column: https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116

How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).

  • Thanks, your analysis is perfect. I have one column with mix numbers and string. I was trying to find some direct resolution but i found workaround . I have posted my answer below. – Rahil Mar 30 '22 at 05:23
  • Just curious on license of polars. Isn't xlsx2csv using the GPLv2 license? Can polars use the MIT license while it has dependencies licensed in GPLv2? – Benjamin Du Jun 06 '22 at 19:43
  • Thank you for pointing this out. The author of xlsx2csv has changed the license to MIT. https://github.com/dilshod/xlsx2csv –  Jun 07 '22 at 12:26
1

My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.

import os
import polars as pl
import pandas as pd

if __name__ == "__main__":

    with open(r"test.xlsx", "rb") as f:
        excelfile = f.read()
    excelfile = pd.ExcelFile(excelfile)
    sheetnames = excelfile.sheet_names
    df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0) 
                    for x in sheetnames 
                    ], axis=0)
    df.to_csv("temp.csv",index=False)
    df_pl = pl.scan_csv("temp.csv")
    os.remove("temp.csv")
Rahil
  • 183
  • 1
  • 2
  • 11
  • 1
    This method wouldn't work with large sets, as writing to and reading from a CSV in either virtual machines or local machines would create a bottleneck. – Bluebird Mar 29 '23 at 21:24