Conversion from big csv to parquet using python error

Question

I have csv file that approximately has 200+ cols and 1mil+ rows. When I am converting from csv to python, i had error:

     csv_file = 'bigcut.csv'
     chunksize = 100_000
     parquet_file ='output.parquet'
     parser=argparse.ArgumentParser(description='Process Arguments')
     parser.add_argument("--fname",action="store",default="",help="specify <run/update>")
     args=parser.parse_args()
     argFname=args.__dict__["fname"]
     csv_file=argFname
     csv_stream = pd.read_csv(csv_file, encoding = 'utf-8',sep=',', >chunksize=chunksize, low_memory=False)
     for i, chunk in enumerate(csv_stream):
        print("Chunk", i)
        if i==0:
           parquet_schema = pa.Table.from_pandas(df=chunk).schema
           parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)
     parquet_writer.close()

When I ran, it produces the following error

    File "pyconv.py", line 25, in <module>
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas
    File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 387, in dataframe_to_arrays
convert_types))
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
    File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 376, in convert_column
raise e
    File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in convert_column
return pa.array(col, type=ty, from_pandas=True, safe=safe)
     File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: ("'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)", 'Conversion failed for column agent_number__c with type float64')

I am new pandas/pyarrow/python, if anyone has any recommendation what should i do next to debug is appreciated.

The csv file is 6 Gb so i am running blind with finding which col/data that cause this. — Yesaya, Jan 15 '19 at 06:53
`>chunksize=chunksize` Is this a typo? (`>` before `chunksize`) — Danny_ds, Jan 15 '19 at 12:23
yes, sorry I must have added accidentally when I pasted here. — Yesaya, Jan 15 '19 at 12:33

score 0 · Answer 1 · answered Jan 15 '19 at 12:19

'utf-32-le' codec can't decode bytes in position 0-3

It looks like a library is trying to decode your data in utf-32-le whereas you read the csv data as utf-8.

So you'll somehow have to tell that reader (pyarrow.lib) to read as utf-8 (I don't know Python/Parquet so I can't provide the exact code to do this).

score 0 · Answer 2 · answered Jan 15 '19 at 23:06

0

The csv has around 3mils records. I managed to catch 1 potential problem.

On 1 of the col has data type of string/text. Somehow most of them are numeric, however some of them is mixed with text, for example many of them are 1000,230,400 etc but few of them was being entered like 5k, 100k, 29k.

So the code somehow did not like it try to set as number/int all around.

Can you advise?

answered Jan 15 '19 at 23:06

Yesaya

21
3

I am still battling with this issue. Anyone could advise how to approach this? – Yesaya Jan 29 '19 at 22:27

Conversion from big csv to parquet using python error

2 Answers2