0

I am a total SQL ignoramus so I apologize if this is very simple..

I have data that contains an ID column consisting of numbers, and in many cases contains leading zeros. I would like to import the data using sqldf, but in doing so I lose the leading zeros for these. Is there a way to keep the leading zeros? Maybe by somehow specifying that all columns are character classes like in R's read.table?

I can't share my data due to the nature of my work, but I am doing something like this:

a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")

f2 <- file(f1)
mydat <- sqldf::sqldf("SELECT * FROM f2", dbname=tempfile(), 
                      file.format=list(header=FALSE, sep="|", eol="\n", skip=1))
mydat

Also, I would like to add that the length is not the same for all of these IDs. If possible, I would like to avoid having to manually pad the data with zeros after the fact..

statsNoob
  • 1,325
  • 5
  • 18
  • 36
  • In your *real* data, is the column (`v1`) stored on disk with leading zeros? – nrussell Dec 03 '15 at 18:31
  • @nrussell Yes, it is. – statsNoob Dec 03 '15 at 18:35
  • Honestly I would just read the data in normally with `read.table(..., colClasses = "character")` - or better yet, use `data.table::fread` - and then handle the data accordingly. You can always use `sqldf` to manipulate the objects that you read in. – nrussell Dec 03 '15 at 18:50
  • @nrussell I would like to do that, but the problem is the size of my data. fread is a great alternative (and is faster) but it throws an error that I cannot seem to rectify for a few files. I actually am using fread in some cases and sqldf in others but need to use both. I have around 20GB of data total spread out in different files. Unfortunately read.table is not even close to fast enough. – statsNoob Dec 03 '15 at 18:57
  • Would you mind posting the error that `fread` is throwing (and the function call that is generating it)? My only other suggestion would be, if possible, store your data (or a copy of it) in a SQLite file using e.g. `VARCHAR(12)` for columns that contain numbers with leading zeros. You could then use `sqldf` to read the data into R in the correct format. – nrussell Dec 03 '15 at 19:10

2 Answers2

2

Use colClasses like this:

library(sqldf)

read.csv.sql(f1, header = FALSE, sep = "|", colClasses = c("character", "numeric"))

giving:

        V1         V2
1  00000029  1.7150650
2  00000078  0.4609162
3  00000040 -1.2650612
4  00000085 -0.6868529
5  00000090 -0.4456620
6  00000005  1.2240818
7  00000050  0.3598138
8  00000083  0.4007715
9  00000051  0.1106827
10 00000042 -0.5558411

Note: We used the input file generated using this random seed:

set.seed(123)

a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

One way to run leading zeros is using SQL string functions. Just impose an amount of zeros higher than your desired string length, concatenate with your actual ID field, and strip from the rightmost character the specified length of column you require. Below uses 8 characters as string length:

mydat <- sqldf::sqldf("select rightstr('0000000000000' || ID, 8) As LeadZeroID, 
                              * from f2;", 
                       dbname=tempfile(), 
                       file.format=list(header=FALSE, sep="|", eol="\n", skip=1))
Parfait
  • 104,375
  • 17
  • 94
  • 125