0

I have csv file that approximately has 200+ cols and 1mil+ rows. When I am converting from csv to python, i had error:

     csv_file = 'bigcut.csv'
     chunksize = 100_000
     parquet_file ='output.parquet'
     parser=argparse.ArgumentParser(description='Process Arguments')
     parser.add_argument("--fname",action="store",default="",help="specify <run/update>")
     args=parser.parse_args()
     argFname=args.__dict__["fname"]
     csv_file=argFname
     csv_stream = pd.read_csv(csv_file, encoding = 'utf-8',sep=',', >chunksize=chunksize, low_memory=False)
     for i, chunk in enumerate(csv_stream):
        print("Chunk", i)
        if i==0:
           parquet_schema = pa.Table.from_pandas(df=chunk).schema
           parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)
     parquet_writer.close()

When I ran, it produces the following error

    File "pyconv.py", line 25, in <module>
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas
    File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 387, in dataframe_to_arrays
convert_types))
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
    File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
    File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 376, in convert_column
raise e
    File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in convert_column
return pa.array(col, type=ty, from_pandas=True, safe=safe)
     File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: ("'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)", 'Conversion failed for column agent_number__c with type float64')

I am new pandas/pyarrow/python, if anyone has any recommendation what should i do next to debug is appreciated.

Yesaya
  • 21
  • 3

2 Answers2

0

'utf-32-le' codec can't decode bytes in position 0-3

It looks like a library is trying to decode your data in utf-32-le whereas you read the csv data as utf-8.

So you'll somehow have to tell that reader (pyarrow.lib) to read as utf-8 (I don't know Python/Parquet so I can't provide the exact code to do this).

Danny_ds
  • 11,201
  • 1
  • 24
  • 46
0

The csv has around 3mils records. I managed to catch 1 potential problem.

On 1 of the col has data type of string/text. Somehow most of them are numeric, however some of them is mixed with text, for example many of them are 1000,230,400 etc but few of them was being entered like 5k, 100k, 29k.

So the code somehow did not like it try to set as number/int all around.

Can you advise?

Yesaya
  • 21
  • 3