4

In R, I'm trying to read in a basic CSV file of about 42,900 rows (confirmed by Unix's wc -l). The relevant code is

vecs <- read.csv("feature_vectors.txt", header=FALSE, nrows=50000)

where nrows is a slight overestimate because why not. However,

>> dim(vecs)
[1] 16853     5

indicating that the resultant data frame has on the order of 17,000 rows. Is this a memory issue? Each row consists of a ~30 character hash code, a ~30 character string, and 3 integers, so the total size of the file is only about 4MB.

If it's relevant, I should also note that a lot of the rows have missing fields.

Thanks for your help!

bartektartanus
  • 15,284
  • 6
  • 74
  • 102
Cardano
  • 931
  • 1
  • 8
  • 14
  • Have you looked to see whether the rows that *were* imported were imported correctly? – blahdiblah Jul 03 '12 at 23:02
  • 5
    My guess is that you have embedded unmatched `"`. So some of your rows are actually much longer than they should be. I'd do something like `apply(vecs, 2, function(x), max(nchar(as.character(x)))` to check. – Justin Jul 03 '12 at 23:08
  • Yup! Justin got it. Adding `quote=""` fixed the problem. – Cardano Jul 03 '12 at 23:16
  • @Justin, please put this as an answer so Cardano can accept it as the correct solution to his problem. :) – Roman Luštrik Jul 04 '12 at 09:47
  • For the record, it's also the case that `read.table` can do some really screwy things if `fill=TRUE` and rows *after* the first five have more fields than any in the first five ... this is referred to obliquely in the help file ... – Ben Bolker Jul 04 '12 at 14:45

2 Answers2

4

This sort of problem is often easy to resolve using count.fields, which tells you how many columns the resulting data frame would have if you called read.csv.

(n_fields <- count.fields("feature_vectors.txt"))

If not all the values of n_fields are the same, you have a problem.

if(any(diff(n_fields)))
{
  warning("There's a problem with the file")
}

In that case look at values of n_fields that are different to what you expect: the problems occur in these rows.

As Justin mentioned, a common problem is unmatched quotes. Open you CSV file and find out how strings are quoted there. Then call read.csv, specifying the quote argument.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
3

My guess is that you have embedded unmatched ". So some of your rows are actually much longer than they should be. I'd do something like apply(vecs, 2, function(x), max(nchar(as.character(x))) to check.

Justin
  • 42,475
  • 9
  • 93
  • 111