R's read.csv() omitting rows

Question

In R, I'm trying to read in a basic CSV file of about 42,900 rows (confirmed by Unix's wc -l). The relevant code is

vecs <- read.csv("feature_vectors.txt", header=FALSE, nrows=50000)

where nrows is a slight overestimate because why not. However,

>> dim(vecs)
[1] 16853     5

indicating that the resultant data frame has on the order of 17,000 rows. Is this a memory issue? Each row consists of a ~30 character hash code, a ~30 character string, and 3 integers, so the total size of the file is only about 4MB.

If it's relevant, I should also note that a lot of the rows have missing fields.

Thanks for your help!

Have you looked to see whether the rows that *were* imported were imported correctly? — blahdiblah, Jul 03 '12 at 23:02
My guess is that you have embedded unmatched `"`. So some of your rows are actually much longer than they should be. I'd do something like `apply(vecs, 2, function(x), max(nchar(as.character(x)))` to check. — Justin, Jul 03 '12 at 23:08
@Justin, please put this as an answer so Cardano can accept it as the correct solution to his problem. :) — Roman Luštrik, Jul 04 '12 at 09:47
For the record, it's also the case that `read.table` can do some really screwy things if `fill=TRUE` and rows *after* the first five have more fields than any in the first five ... this is referred to obliquely in the help file ... — Ben Bolker, Jul 04 '12 at 14:45

score 4 · Accepted Answer · answered Jul 04 '12 at 13:15

This sort of problem is often easy to resolve using count.fields, which tells you how many columns the resulting data frame would have if you called read.csv.

(n_fields <- count.fields("feature_vectors.txt"))

If not all the values of n_fields are the same, you have a problem.

if(any(diff(n_fields)))
{
  warning("There's a problem with the file")
}

In that case look at values of n_fields that are different to what you expect: the problems occur in these rows.

As Justin mentioned, a common problem is unmatched quotes. Open you CSV file and find out how strings are quoted there. Then call read.csv, specifying the quote argument.

score 3 · Answer 2 · answered Jul 04 '12 at 14:40

3

My guess is that you have embedded unmatched ". So some of your rows are actually much longer than they should be. I'd do something like apply(vecs, 2, function(x), max(nchar(as.character(x))) to check.

answered Jul 04 '12 at 14:40

Justin

42,475
9
93
111

R's read.csv() omitting rows

2 Answers2

Linked