0

So, i have this input csv of the form,

id,No.,V,S,D
1,0100000109,623,233,331
2,0200000109,515,413,314
3,0600000109,611,266,662

I need to read the No. Column as it is(i.e., as a character). I know i can use something like this for that:

data <- read.csv("input.csv", colClasses = c("MSISDN" = "character"))

I have a code that i'm using to read the csv file in chunks:

chunk_size <- 2
con  <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,colClasses = c("MSISDN" =   "character"),quote="",header = TRUE,)
header <- names(data_frame)
print(header)
print(data_frame)
if(nrow(data_frame) == chunk_size) {
repeat {
data_frame <- read.csv(con,nrows = chunk_size, header = FALSE, quote="")
names(data_frame)<-c(header)
print(header)
print(data_frame)
if(nrow(data_frame) < chunk_size) {
  break
}
}
}

close(con)

But, here what the issue i'm facing is that, the first chunk will only read the No. Column as a character, the rest of the chunks will not.

How can i resolve this?

PS: the original input file has about 150+ columns and about 20 Million rows.

Raymond
  • 103
  • 7

2 Answers2

0

You need to give the column type colClasses in the read.csv() inside the repeat procedure. You no longer have the header so you need to define an unnamed vector to specify the colClasses. Let's say the size of colClasses is 150.

myColClasses=rep("numeric",150) myColClasses[2] <- "character" repeat { data_frame <- read.csv(con,nrows = chunk_size, colClasses=myColClasses, header = FALSE, quote="") ...

phileas
  • 830
  • 4
  • 11
  • the input i've provided is just a sample. The original file contains about 150+ columns. It would be very difficult to employ your solution then. Is there any alternate way? – Raymond Feb 10 '17 at 10:25
0

You can read the data as string with readLines and split it:

fileName <- "input.csv"
df <- do.call(rbind.data.frame, strsplit(readLines(fileName), ",")[-1]) # skipping headlines
colnames(df) <- c("id","No.","V","S","D") #adding headlines

or the direct approach with read.csv:

fileName <- "input.csv"
col <- c("integer","character","integer","integer","integer")
df <- read.csv(file = fileName,
               sep = ",", 
               colClasses=col, 
               header = TRUE, 
               stringsAsFactors = FALSE)
holzben
  • 1,459
  • 16
  • 24
  • As i've already mentioned in the question, the input file contains about 150+ columns and manually putting datatypes for all the column headers is very difficult. – Raymond Feb 13 '17 at 07:13
  • in my first code snippet you don't need to to that. The correction of the column names (third line) can be done automatically as well eg. use readLine – holzben Feb 13 '17 at 07:25