Read tab delimited text file in Spark R

Question

I have a tab delimited file that is saved as a .txt with " " around the string variables. The file can be found here.

I am trying to read it into Spark-R (version 3.1.2), but cannot successfully bring it into the environment. I've tried variations of the read.df code, like this:

df <- read.df(path = "FILE.txt", header="True", inferSchema="True", delimiter = "\t", encoding="ISO-8859-15")

df <- read.df(path = "FILE.txt", source = "txt", header="True", inferSchema="True", delimiter = "\t", encoding="ISO-8859-15")

I have had success with bringing in CSVs with read.csv, but many of the files I have are over 10GB, and is not practical to convert them to CSV before bring them into Spark-R.

EDIT: When I run read.df I get a laundry list of errors, starting with this:

I am able to bring in csv files used in a previous project with both read.df and read.csv, so I don't think it's a java issue.

jsizzle · Answer 1 · 2021-07-13T20:22:09.480

0

If you don't need to specifically use Spark R, then base R read.table should work just fine for the .txt you provided. Note that it is tab-delimited, and so this should be specified.

Something like this should work:

dat <- read.table("FILE.TXT",  
                  sep="\t",
                  header=TRUE)

edited Jul 13 '21 at 20:22

answered Jul 13 '21 at 19:28

jsizzle

78
8

It does not. I was hoping the link to the .txt file would be helpful. Is there anything I can do to make it clearer? – Obie K Jul 13 '21 at 19:34
See update, I think you can easily solve this with read.table if you don't need to specifically use R-Spark. – jsizzle Jul 13 '21 at 20:22
Note that the above avoids the read-in error you shared and then you can of course feed the dat object into R-Spark as a df, but if the issue is loading it into memory, then this does not help. – jsizzle Jul 13 '21 at 20:30
Thanks! The issue is that a few of my files are absolutely massive - around 15GB or so. I thought that SparkR would be the best way to bring them in since running ```read.table``` never finishes. Is there a way for ```read.table``` to handle such large files? – Obie K Jul 13 '21 at 23:37

Read tab delimited text file in Spark R

1 Answers1