0

So I have this tsv dataset made of 19,150,868 rows; I know for sure the number is correct because A) it was specified by the owner of the file and B) I checked using wc -l in UNIX.

Yet, when I ran:

MyData = read.table("dataset.tsv", header=FALSE, sep="\t",
col.names = c_names, colClass = "character", comment.char = "",
quote="", nrows = 19150868)

Only the first 835873 got imported. No error is thrown, and the process only takes 20.33 seconds.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Edgar Derby
  • 2,543
  • 4
  • 29
  • 48
  • Is there anything strange about that particular row? Do the read-in lines in R look as they should? – Thomas Oct 20 '13 at 21:18
  • @Thomas: The last line fetched is `835873 user_000033 2007-05-24T19:50:25Z ~8+ ŤÄ`. – Edgar Derby Oct 20 '13 at 21:24
  • I'm assuming that is not what it's supposed to look like? If that's the case, play around with the `fileEncoding` parameter in `read.table`. – Thomas Oct 20 '13 at 21:27
  • @Thomas: I tried different encodings but none of them works... :( – Edgar Derby Oct 20 '13 at 21:30
  • 1
    You probably have an embedded Ctrl-Z which is causing R to abort the read prematurely. The easiest thing to do may be to edit the file to remove it. – Hong Ooi Oct 20 '13 at 21:50
  • Try creating a file with just the last line successfully read and then the next few lines after that, and try to read that. – mrip Oct 20 '13 at 23:03
  • Did you check the encoding of the file? – wush978 Oct 21 '13 at 07:42

0 Answers0