6

In ?read.table is stated that:

The number of data columns is determined by looking at the first five lines of input
(or the whole file if it has less than five lines), or from the length of col.names
if it is specified and is longer. This could conceivably be wrong if fill or
blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

I need to use the fill paramenter and some of my txt files may have the row with the highest number of column after the 5th row. I can't use an header, just because I don't have it and the col.names will be defined after the import, so I would like to change these 5 rows that R used into the whole file, (I don't mind any speed loss I could get). Any suggestion? Thanks!

EDIT:

just found this in the code of read.table

if (skip > 0L) 
    readLines(file, skip)
nlines <- n0lines <- if (nrows < 0L) 
    5
else min(5L, (header + nrows))
lines <- .External(C_readtablehead, file, nlines, comment.char, 
    blank.lines.skip, quote, sep)
nlines <- length(lines)

can I just change the number 5 in the 4th rows of the above code? is that going to have any side effect on the read.table behaviours?

EDIT 2:

I'm currently using this method

maxCol <- max(sapply(readLines(filesPath), function(x) length(strsplit(x, ",")[[1]])))

to have the max number of columns, and putting the result to create dummy col.names like paste0("V", seq_len(maxCol)). Do you think is still worth to have another read.table with the possibility to chose that?

Michele
  • 8,563
  • 6
  • 45
  • 72
  • What about if you specify which columns to read via `colClasses`? – Roman Luštrik May 16 '13 at 10:43
  • Hi, the thing is. I don't know in advance the number of columns (got many files to import, with different number columns per row, so I use `fill`). I could run a `scan` and check the `max(col)` before calling `read.table`, but it'd be more consistent (to me) having the possibility to choose the number of lines to scan (in place of an hard coded `5`) – Michele May 16 '13 at 11:05
  • 1
    Can you provide more details on how and why your source files vary in format so much? Personally, I'd shoot the dope who created them, but that's occasionally frowned upon. And why do the row lengths vary so much? If it's a matter of a lot of header/metadata rows prior to a block of data, then consider using the `skip` argument. Otherwise, maybe just run `readLines` to load the entire file and then parse it out. – Carl Witthoft May 16 '13 at 11:41
  • Just in case someone's reading this, I'm upvoting Carl's comment, but not the shooting part, the `readLines` and parsing part. – Roman Luštrik May 16 '13 at 12:04
  • @CarlWitthoft I haven't created such files, but I have to work with them. – Michele May 16 '13 at 15:19

1 Answers1

5

Use count.fields, e.g.,

read.table(filesPath, colClasses=rep(NA, max(count.fields(filesPath))), fill=TRUE)
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • 1
    Thanks! Once again, a useful function I'd never seen is posted on SO. – Carl Witthoft May 16 '13 at 12:47
  • hi ya! thanks. good answer! I'd improved my EDIT2 solution in the mean time doing `max(unlist(lapply(gregexpr(",", readLines(files[j])), length)))`. But yours it's far more elegant and compact. It's also a bt faster, but I don't mind speed so much for this task, since I've got (lots of) very small files. – Michele May 16 '13 at 15:28
  • one question: using `colClasses` I guess I'll by-pass the script that tries to assign classes in `read.table`. Setting the classes to `NA` will the output be all character columns? thanks. – Michele May 16 '13 at 15:38
  • 1
    No, `NA` tells `R` to use `type.convert` to determine the datatype. – Matthew Plourde May 16 '13 at 15:39
  • I see. So, since `colClasses` is not `null` you are telling `R` the number of columns you want, but since you are not actually supplying any data type it'll run `type.convert`. – Michele May 16 '13 at 15:47