read.table only reads the first 835873 rows

Asked Oct 20 '13 at 21:15

Active Dec 06 '15 at 16:06

Viewed 123 times

So I have this tsv dataset made of 19,150,868 rows; I know for sure the number is correct because A) it was specified by the owner of the file and B) I checked using wc -l in UNIX.

Yet, when I ran:

MyData = read.table("dataset.tsv", header=FALSE, sep="\t",
col.names = c_names, colClass = "character", comment.char = "",
quote="", nrows = 19150868)

Only the first 835873 got imported. No error is thrown, and the process only takes 20.33 seconds.

edited Dec 06 '15 at 16:06

Brian Tompsett - 汤莱恩

5,753
72
57
129

asked Oct 20 '13 at 21:15

Edgar Derby

2,543
4
29
48

Is there anything strange about that particular row? Do the read-in lines in R look as they should? – Thomas Oct 20 '13 at 21:18
@Thomas: The last line fetched is `835873 user_000033 2007-05-24T19:50:25Z ~8+ Å¤Ä`. – Edgar Derby Oct 20 '13 at 21:24
I'm assuming that is not what it's supposed to look like? If that's the case, play around with the `fileEncoding` parameter in `read.table`. – Thomas Oct 20 '13 at 21:27
@Thomas: I tried different encodings but none of them works... :( – Edgar Derby Oct 20 '13 at 21:30
1

You probably have an embedded Ctrl-Z which is causing R to abort the read prematurely. The easiest thing to do may be to edit the file to remove it. – Hong Ooi Oct 20 '13 at 21:50
Try creating a file with just the last line successfully read and then the next few lines after that, and try to read that. – mrip Oct 20 '13 at 23:03
Did you check the encoding of the file? – wush978 Oct 21 '13 at 07:42

read.table only reads the first 835873 rows

0 Answers0