0

I got below error when writing dask dataframe to S3. Couldn't figure out why. Does anybody know how to fix.

dd.from_pandas(pred, npartitions=npart).to_parquet(out_path)

The error is

error.. Error converting column "team_nm" to bytes using encoding UTF8. Original error: bad argument type for built-in operation Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/fastparquet/writer.py", line 175, in convert out = array_encode_utf8(data) File "fastparquet/speedups.pyx", line 60, in fastparquet.speedups.array_encode_utf8 TypeError: bad argument type for built-in operation

During handling of the above exception, another exception occurred:

I tried to encode the "team_nm" to "latin-1" before writing to parquet but doesn't work.

pred['team_nm'] = pred['team_nm'].str.encode("Latin-1")

Tried to upgrade fastparquet from 0.4.1 to 0.7.1 but it doesn't work either

Justin Shan
  • 81
  • 1
  • 2
  • `python3.7` ? That's a bit .... old. The current version is 3.11. The error doesn't complain about UTF8, it complains about the contents of the field. Are you *sure* it contains text? If not, no encoding will work. If you google the error `bad argument type for built-in operation` you'll see it's often returned when trying to treat non-text data (eg Paths, lists) as strings. – Panagiotis Kanavos Aug 29 '23 at 17:02
  • What does `pred.dtypes` return? – Panagiotis Kanavos Aug 29 '23 at 17:03
  • object. I am sure "team_nm" is string – Justin Shan Aug 29 '23 at 19:55

1 Answers1

0

Parquet assumes UTF8 encoding and no other encoding is possible, so if your text is something else, it will fail. If you encode your column yourself to bytes, you can indeed choose any encoding you like, so long as wherever you are loading is prepared to do the decoding manually too.

If you have a column of bytes (because you encoded manually), then fastparquet will generally be able to guess this unless your column starts with some NULL/None values. To help it along, you can use the argument object_encoding='bytes' (all object columns to be interpreted as bytes) or object_encoding={'team_nm': 'bytes'} (the one specific column if known to be bytes).

mdurant
  • 27,272
  • 5
  • 45
  • 74