I am learning the Apache Arrow for R and met the following issue. My dataset has 85+ million rows what makes the utilizing of Arrow really useful. I do the following very simple steps.
- Open the existing dataset as arrow table
Sales_data <- open_dataset("Sales_Arrow", format = "csv")
Sales_data
The result is:
FileSystemDataset with 4 csv files
SBLOC: string
Cust-Loc: string
Cust-Item-Loc: string
SBPHYP: int64
SBINV: int64
Cust(Child)Entity: string
SBCUST: string
SBITEM: string
SBTYPE: string
Qty: double
SBPRIC: double
SBICST: double
Unit_Cost_Net: double
SBINDT: date32[day]
SASHIP: string
Entity: int64
ParentCustID: string
ParentCustName: string
Customer-ShipID-Loc: string
Pred_Entity_Loc: string
Cust(Child)-Entity: string
Item-Entity: string
Right after that I write the dataset to the disk as partitioned arrow data
write_dataset(Sales_data, "Sales All Partitioned", partitioning = c("Entity", "SBPHYP"))
and get the following ERROR
Error: Invalid: In CSV column #4: Row #444155: CSV conversion error to int64: invalid value '5e+06'
I checked the value in Sales_data[444155, 4]. It's absolutely the same as several previous and next rows. 201901
Please help me to understand what's going on and how to fix this issue