2

I'm using R to tidy data supplied to me (in a SAS file) so that I can bulk insert it into a SQLserver database. The problem that I'm having is that sometimes numeric fields get transformed by R after I read them in eg.(the leading 0 gets dropped, some numeric fields convert to scientific notation, long ID numbers turn into gibberish after the 15th digit).

Reading all the data into R as character solves these issues. When I'm supplied a csv file I can use data.tables 'fread' function to specify colClasses = 'character' however as far as I'm aware something like this doesnt exist for the 'read_sas' function from the haven package.

Are there any workarounds or extra documentation on how I can better approach and solve this issue?


Edit to highlight issues (left values is numeric and what I want to avoid, right value is as character and what I want):

1.

postcode <- c(0629,'0629') postcode [1] "629" "0629"

2.

id <- c(12000000,'12000000') id [1] "1.2e+07" "12000000"

3.

options(scipen=999) id <- c(123123123123123123123123,'123123123123123123123123') id [1] "123123123123123117883392" "123123123123123123123123"

How can I import the data directly from SAS so that all columns in the data frame are read in as character data type (in order to avoid data quality issues when I insert into SQLserver)

Maharero
  • 238
  • 1
  • 10
  • What do you mean about dropping leading zeros from numeric fields? Leading zeros cannot exist on a numeric value. There is no difference between the number `00123` and the number `123`. A SAS dataset stores numbers as floating point, so the maximum precision is about 15 digits. Sounds like perhaps your issue is with the creation of the fields as numbers in the SAS dataset, not in the R code. – Tom Mar 15 '18 at 04:06
  • Hey Tom, just as an example with the leading zeros is a column containing post codes - a post code like `0629` gets converted into `629` by R and then gets displayed like this in SQL. The SAS files get supplied to me from other agencies so I'm not really sure what format everything is in - the reason I'm using R to view these file is that our R servers open and view the datasets in 15m whereas it takes several hours in SAS just to view it. Our team is also much more comfortable with R compared to SAS. – Maharero Mar 15 '18 at 19:08
  • 1
    If you want `0629` to be different from `629` you need to store it as a character variable. But if it is a character variable in SAS it should become a character variable in R. So your problem is the SAS file. Perhaps they are storing the post codes as numbers and applying the Z4. format to have them print with leading zeros? In that case you would need to similar transform after reading the dataset from SAS into R. – Tom Mar 15 '18 at 19:18
  • Cheers Tom - I thought that read_sas 'tries to guess' the class of each column when importing by looking at the data contained in a subset of each columns rows but if instead read_sas carries over the same data types in the SAS files then there is not much I can do on my end to solve the issues. If the source agency is supplying me the data where id's are floating point and not character then issue #3 will already have happened on their end - I'll let my supplier know of this. – Maharero Mar 15 '18 at 20:02

0 Answers0