How to open this tsv with R?

Question

I'd like to open this tsv file : userid-timestamp-artid-artname-traid-traname.tsv (http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html)

I know this file contains 19,150,868 rows but when I read this file with R I only obtain 835K rows.

setwd('C:/xxx/lastfm-dataset-1K.tar/lastfm-dataset-1K')

df <- read.table('userid-timestamp-artid-artname-traid-traname.tsv', header=F, sep='\t', fill=T, quote='')

Sometimes some columns are empty, this is why I'm using fill=T.

I'm pretty sure the problem comes from special characters. The last line fetched is: user_000033 2007-05-24T19:50:25Z ~8+ Å¤Ä

I tried several fileEncoding but none of them works.

EDIT:

Someone else had the same issue with the exact same file, but no answer have been identified : read.table only reads the first 835873 rows

I finally did it with Python and it works :

import pandas as pd     
import csv
df = pd.read _csv('userid-timestamp-artid-artname-traid-traname.tsv', quoting=csv.QUOTE_NONE, header=None , sep='\t', na_values=[''],  error_bad_lines=False)

So the question is : How to do the same with R ? Why the weird characters cause a problem to R and not to Python ?

I think the best thing might be to store the data in a SOLite database. http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-td4350555.html — germcd, Jan 24 '15 at 23:42
I would recommend `fread` from the `data.table` package for that size of data — mlegge, Jan 25 '15 at 00:34
@mkemp6 `fread`doesn't work because it can't handle rows with missing columns : http://stackoverflow.com/questions/18597488/fill-option-for-fread — Stephane, Jan 25 '15 at 00:38
@Stephane do you have an opportunity to preprocess with awk or python? — mlegge, Jan 25 '15 at 01:21
@mkemp6. Yes, I did it in Python and it works *(see update)*. I don't really understand why the weird characters cause a problem to R and not to Python ! :/ — Stephane, Jan 25 '15 at 11:22
@Stephane, why not add the extra commas in Python then write out so you can read into R with `fread`? — mlegge, Jan 26 '15 at 00:39
@mkemp6. This is what I did, but I'd like to understand what's wrong. I'm curious :) — Stephane, Jan 26 '15 at 01:14

score 3 · Answer 1 · answered Jan 25 '15 at 00:40

3

I think SQLite is a good idea, and here is how I was able to implement it

From a shell:

$ sqlite3 test.db

Then in sqlite:

sqlite> create table test (userid text, timestamp text, artid text, artname text, traid text, traname text);
sqlite> .separator "\t"
sqlite> .import userid-timestamp-artid-artname-traid-traname.tsv test

Then back in R:

R> library(RSQLite)
Loading required package: DBI
R> conn <- dbConnect(SQLite(), dbname='test.db')
R> df <- dbGetQuery(conn, "SELECT * FROM test")
R> nrow(df)
[1] 15121996

Its not 19 million, but its close enough on a difficult dataset

answered Jan 25 '15 at 00:40

Stedy

7,359
14
57
77

Why we don't have all the rows with this solution ? – Stephane Jan 25 '15 at 02:27
Because there are likely tabs in the song names, which R or SQLite interpret as the start of a new column. You could always use vi or emacs to edit the file and manually add your own field separator. – Stedy Jan 25 '15 at 02:33

How to open this tsv with R?

1 Answers1