I'd like to open this tsv file : userid-timestamp-artid-artname-traid-traname.tsv (http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html)
I know this file contains 19,150,868 rows but when I read this file with R I only obtain 835K rows.
setwd('C:/xxx/lastfm-dataset-1K.tar/lastfm-dataset-1K')
df <- read.table('userid-timestamp-artid-artname-traid-traname.tsv', header=F, sep='\t', fill=T, quote='')
Sometimes some columns are empty, this is why I'm using fill=T
.
I'm pretty sure the problem comes from special characters.
The last line fetched is: user_000033 2007-05-24T19:50:25Z ~8+ ŤÄ
I tried several fileEncoding
but none of them works.
EDIT:
Someone else had the same issue with the exact same file, but no answer have been identified : read.table only reads the first 835873 rows
I finally did it with Python and it works :
import pandas as pd
import csv
df = pd.read _csv('userid-timestamp-artid-artname-traid-traname.tsv', quoting=csv.QUOTE_NONE, header=None , sep='\t', na_values=[''], error_bad_lines=False)
So the question is : How to do the same with R ? Why the weird characters cause a problem to R and not to Python ?