1

I am learning the Apache Arrow for R and met the following issue. My dataset has 85+ million rows what makes the utilizing of Arrow really useful. I do the following very simple steps.

  1. Open the existing dataset as arrow table
Sales_data <- open_dataset("Sales_Arrow", format = "csv")
Sales_data

The result is:

FileSystemDataset with 4 csv files
SBLOC: string
Cust-Loc: string
Cust-Item-Loc: string
SBPHYP: int64
SBINV: int64
Cust(Child)Entity: string
SBCUST: string
SBITEM: string
SBTYPE: string
Qty: double
SBPRIC: double
SBICST: double
Unit_Cost_Net: double
SBINDT: date32[day]
SASHIP: string
Entity: int64
ParentCustID: string
ParentCustName: string
Customer-ShipID-Loc: string
Pred_Entity_Loc: string
Cust(Child)-Entity: string
Item-Entity: string

Right after that I write the dataset to the disk as partitioned arrow data

write_dataset(Sales_data, "Sales All Partitioned", partitioning = c("Entity", "SBPHYP"))

and get the following ERROR

Error: Invalid: In CSV column #4: Row #444155: CSV conversion error to int64: invalid value '5e+06'

I checked the value in Sales_data[444155, 4]. It's absolutely the same as several previous and next rows. 201901

Please help me to understand what's going on and how to fix this issue

grislepak
  • 31
  • 3

1 Answers1

3

This seems to be related to ARROW-17241 which is caused by integers saved in scientific notation which is not recognized as int64 by the arrow csv reader.

The issue only pops up when writing the data because open_dataset is lazy so it only gets read when writing.

A workaround would be to pass a schema when opening or casting the column to float 64:

# Get the automatically inferred schema
csv_schema <- Sales_data$schema
# Change col 4 to float64()
csv_schema$SBPHYP <- float64()

# Cast to float64
Sales_data <- Sales_data$cast(target_schema = csv_schema)

You should then be able to cast back to int if you require it.

assignUser
  • 211
  • 4
  • Thank you. One more question. Is there any replacement for the ```slice``` function for ```arrow table```? – grislepak Oct 18 '22 at 21:11
  • @grislepak arrow tables have a [slice method](https://arrow.apache.org/docs/dev/r/reference/Table.html#r6-methods) – assignUser Oct 19 '22 at 11:43