0

A little mystery. I have a .tsv file that contains 58936 rows. I loaded the file into R using this command:

dat <- read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE, sep="\t")

but nrow(dat) only shows this:

> nrow(dat)
[1] 28485

So I used the sed -n command to write the rows around where it stopped (before, including and after that row) to a new file and was able to load that file into R so I don't think there was any corruption in the file.

Is it an environment issue?

Here's my sessionInfo()

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sqldf_0.4-10   RSQLite_1.0.0  DBI_0.3.1      gsubfn_0.6-6   proto_0.3-10   scales_0.2.4   plotrix_3.5-11
[8] reshape2_1.4.1 dplyr_0.4.1   

loaded via a namespace (and not attached):
 [1] assertthat_0.1   chron_2.3-45     colorspace_1.2-4 lazyeval_0.1.10  magrittr_1.5     munsell_0.4.2   
 [7] parallel_3.1.2   plyr_1.8.1       Rcpp_0.11.4      rpart_4.1-8      stringr_0.6.2    tools_3.1.2 

Did I run out of memory? Is that why it didn't finish loading?

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
Jazzmine
  • 1,837
  • 8
  • 36
  • 54

3 Answers3

2

I had a similar problem lately, and it turned out I had two different problems.

1 - Not all rows had the right number of tabs. I ended up counting them using awk

2 - At some points in the file I had quotes that were not closed. This was causing it to skip over all the lines until it found a closing quote.

I will dig up the awk code I used to investigate and fix these issues and post it.

Since I am using Windows, I used the awk that came with git bash.

This counted the number of tabs in a line and printed out those lines that did not have the right number.

  awk -F "\t" 'NF!=6 { print NF-1 ":" $0 } ' Catalog01.csv  

I used something similar to count quotes, and I used tr to fix a lot of it.

Mike Wise
  • 22,131
  • 8
  • 81
  • 104
  • Hi Mike: I'm going to try that now. But since I was able to read the rows around, and including that row, I'm not sure that's it. Thanks – Jazzmine Feb 14 '15 at 16:08
  • I bet missing quotes is part of the problem. That is usually the case when so many rows are missing. But maybe not... – Mike Wise Feb 14 '15 at 16:32
  • Ok I made the assumption that it read the first x number of rows but maybe some just failed to load and it made it to the end of the file. But rejected some of the rows. Is there a way to tell this? – Jazzmine Feb 14 '15 at 16:38
  • With awk (or sed I suppose) you could easily add a running number as the first field, and then you could see what rows made it in. – Mike Wise Feb 14 '15 at 16:40
  • Thanks for the awk command. It worked great to point out I had 86 rows with a different number of columns. I suspect there were other problems that this command didn't catch but at least I now know it is most likely a set of data quality issues. I also added a ; wc -l at the end of the awk command to show the number of failing rows. – Jazzmine Feb 16 '15 at 04:04
  • I fixed the awk so that it actually counted tabs and not commas :) – Mike Wise Feb 16 '15 at 12:23
1

Pretty sure this was not a memory issue. If the problem is unmatched quotes then try this:

t <-read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE,sep="\t",
          quote="")

There is also the very useful function count.fields that I use inside table to get a high-level view of the consequences of various parameter settings. Take a look at results of:

table( count.fields( "weekly_devdata.tsv", sep="\t"))

And compare to:

table( count.fields( "weekly_devdata.tsv",  sep="\t", quote=""))

It's sometime necessary to read in with readLines, then remove one or more lines assigning the result to clean and then send the cleaned up lines to read.table(text=clean, sep="\t", quote="")

IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

Could well be some illegal characters in some of the entries... Check out how many upload and where the issue is taking place. Delve deeper into that row of the raw data. Webdings font that kind of stuff!

RjD
  • 1
  • Welcome to SO. Please explain how the OP might go about doing some of your suggestions. For example, it might be good to explain how to check how many uploads are going on. – Sumner Evans Feb 14 '15 at 22:16