0

I'd like to open this tsv file : userid-timestamp-artid-artname-traid-traname.tsv (http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html)

I know this file contains 19,150,868 rows but when I read this file with R I only obtain 835K rows.

setwd('C:/xxx/lastfm-dataset-1K.tar/lastfm-dataset-1K')

df <- read.table('userid-timestamp-artid-artname-traid-traname.tsv', header=F, sep='\t', fill=T, quote='')

Sometimes some columns are empty, this is why I'm using fill=T.

I'm pretty sure the problem comes from special characters. The last line fetched is: user_000033 2007-05-24T19:50:25Z ~8+ ŤÄ

I tried several fileEncoding but none of them works.

EDIT:

Someone else had the same issue with the exact same file, but no answer have been identified : read.table only reads the first 835873 rows

I finally did it with Python and it works :

import pandas as pd     
import csv
df = pd.read _csv('userid-timestamp-artid-artname-traid-traname.tsv', quoting=csv.QUOTE_NONE, header=None , sep='\t', na_values=[''],  error_bad_lines=False)

So the question is : How to do the same with R ? Why the weird characters cause a problem to R and not to Python ?

Stephane
  • 1
  • 2
  • 1
    I think the best thing might be to store the data in a SOLite database. http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-td4350555.html – germcd Jan 24 '15 at 23:42
  • I would recommend `fread` from the `data.table` package for that size of data – mlegge Jan 25 '15 at 00:34
  • @mkemp6 `fread`doesn't work because it can't handle rows with missing columns : http://stackoverflow.com/questions/18597488/fill-option-for-fread – Stephane Jan 25 '15 at 00:38
  • @Stephane do you have an opportunity to preprocess with awk or python? – mlegge Jan 25 '15 at 01:21
  • @mkemp6. Yes, I did it in Python and it works *(see update)*. I don't really understand why the weird characters cause a problem to R and not to Python ! :/ – Stephane Jan 25 '15 at 11:22
  • @Stephane, why not add the extra commas in Python then write out so you can read into R with `fread`? – mlegge Jan 26 '15 at 00:39
  • @mkemp6. This is what I did, but I'd like to understand what's wrong. I'm curious :) – Stephane Jan 26 '15 at 01:14

1 Answers1

3

I think SQLite is a good idea, and here is how I was able to implement it

From a shell:

$ sqlite3 test.db

Then in sqlite:

sqlite> create table test (userid text, timestamp text, artid text, artname text, traid text, traname text);
sqlite> .separator "\t"
sqlite> .import userid-timestamp-artid-artname-traid-traname.tsv test

Then back in R:

R> library(RSQLite)
Loading required package: DBI
R> conn <- dbConnect(SQLite(), dbname='test.db')
R> df <- dbGetQuery(conn, "SELECT * FROM test")
R> nrow(df)
[1] 15121996

Its not 19 million, but its close enough on a difficult dataset

Stedy
  • 7,359
  • 14
  • 57
  • 77
  • Why we don't have all the rows with this solution ? – Stephane Jan 25 '15 at 02:27
  • Because there are likely tabs in the song names, which R or SQLite interpret as the start of a new column. You could always use vi or emacs to edit the file and manually add your own field separator. – Stedy Jan 25 '15 at 02:33