Integer datatypes with missing values changes to object in python using pyreadr package, after importing data from RData file

Question

I want to execute some python functions using data from '.RData' file. I am using the 'pyreadr' python package for the same.

Here is example of R Code

library(data.table)

# Example 
data <- data.table(x_num=c(1,1.5,2),
                   x_int=c(1,2,3))
data$x_int <- as.integer(data$x_int) # Making sure the data is in integer type


data_missing <- data.table(x_num=c(1.5,2,NA,5,6),
                   x_int=c(1,2,3,NA,5))
data_missing$x_int <- as.integer(data_missing$x_int) # Making sure the data is in integer type

# checking the classes
sapply(data,class)
sapply(data_missing,class)

# Storing the data in RData file 
save(data, file = "test_data.RData")
save(data_missing, file = "test_missing_data.RData")

The reason I am storing it in different files is because the 'test_data.RData' is successfully loaded in python, however the 'test_missing_data.RData' is converting values with NA data to object rather than integer datatype.

Here is the Python Code

# Working example
import pyreadr
result=pyreadr.read_r('test_data.RData')
data=result['data']
data.dtypes
# Output
# x_num    float64
# x_int      int32

# Example where NA values are converted to object datatype
import pyreadr
result=pyreadr.read_r('test_missing_data.RData') # Error 

data=result['data_missing']
data.dtypes
# Output
# x_num    float64
# x_int     object

There is no error message, however I need the datatype to remain in integer even with missing or NA values.

Thank you for your time and help.

score 0 · Answer 1 · answered Aug 09 '22 at 13:53

At the moment what you describe is the correct behavior of the package. This is because in older versions of pandas, a numpy integer array was used and those do not allow to set a numpy nan value, which is a float, and was the only available missing value representation. Therefore the column type had to be set to object to be able to cope with data in two different types: integer and float.

In more recent times pandas has introduced a nullable integer column type.

Pyreadr will convert those object columns back to an R integer when writing back to R.

When writing integers to R you have to make sure that these are 32 bit integers or below. This is because in R all integers are 32 bit, but in pandas you can have 64, 32, 16 or 8 bit integers. 64 bit integers cannot be translated to 32 bit integers because there is the risk of overflow. If you set your own integer columns, the best is to convert them to the type 'Int32' (observe the capital I) and pyreadr will convert them correctly to R integers.

Integer datatypes with missing values changes to object in python using pyreadr package, after importing data from RData file

1 Answers1