Reading Text file in SparkR 1.4.0

Question

Does anyone know how to read a text file in SparkR version 1.4.0? Are there any Spark packages available for that?

zero323 · Accepted Answer · 2016-05-12T12:35:05.130

Spark 1.6+

You can use text input format to read text file as a DataFrame:

read.df(sqlContext=sqlContext, source="text", path="README.md")

Spark <= 1.5

Short answer is you don't. SparkR 1.4 has been almost completely stripped from low level API, leaving only a limited subset of Data Frame operations. As you can read on an old SparkR webpage:

As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4). (...) Initial support for Spark in R be focussed on high level operations instead of low level ETL.

Probably the closest thing is to load text files using spark-csv:

> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
|                  C0|
+--------------------+
|      # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+

Since typical RDD operations like map, flatMap, reduce or filter are gone as well it is probably what you want anyway.

Now, low level API is still underneath so you can always do something like below but I doubt it is a good idea. SparkR developers most likely had a good reason to make it private. To quote ::: man page:

It is typically a design mistake to use ‘:::’ in your code since the corresponding object has probably been kept internal for a good reason. Consider contacting the package maintainer if you feel the need to access the object for anything but mere inspection.

Even if you're willing to ignore good coding practices I it is most likely not worth the time. Pre 1.4 low level API is embarrassingly slow and clumsy and without all the goodness of the Catalyst optimizer it is most likely the same when it comes to internal 1.4 API.

> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)

[[1]]
[1] 14

[[2]]
[1] 0

[[3]]
[1] 78

Not that spark-csv, unlike textFile, ignores empty lines.

Abhishek Choudhary · Answer 2 · 2015-07-01T10:25:40.467

Please follow the links http://ampcamp.berkeley.edu/5/exercises/sparkr.html

we can simply use -

 textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/README.md")

While checking the SparkR code, Context.R has textFile method , so ideally a SparkContext must have textFile API to create the RDD , but thats missing in doc.

# Create an RDD from a text file.
#
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
#
# @param sc SparkContext to use
# @param path Path of file to read. A vector of multiple paths is allowed.
# @param minPartitions Minimum number of partitions to be created. If NULL, the default
#  value is chosen based on available parallelism.
# @return RDD where each item is of type \code{character}
# @export
# @examples
#\dontrun{
#  sc <- sparkR.init()
#  lines <- textFile(sc, "myfile.txt")
#}
textFile <- function(sc, path, minPartitions = NULL) {
  # Allow the user to have a more flexible definiton of the text file path
  path <- suppressWarnings(normalizePath(path))
  # Convert a string vector of paths to a string containing comma separated paths
  path <- paste(path, collapse = ",")

  jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
  # jrdd is of type JavaRDD[String]
  RDD(jrdd, "string")
}

Follow the link https://github.com/apache/spark/blob/master/R/pkg/R/context.R

For test case https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_rdd.R

Thanks for your reply ABC. But I actually meant for SparkR 1.4.0 version. There is not a function called textFile() in this version. Check this link https://spark.apache.org/docs/latest/api/R/index.html — Edwin Vivek N, Jul 01 '15 at 09:52
if you check through the code https://github.com/apache/spark/blob/master/R/pkg/R/RDD.R the comment says RDD can be created using textFile but I can't see any textFile code there but on checking Context.R , i do find the method https://github.com/apache/spark/blob/master/R/pkg/R/context.R — Abhishek Choudhary, Jul 01 '15 at 10:22
Yes but my question is how to call that method from Spark 1.4.0 API. The latest version doesn't have that method. If you have read a text file using **SparkR 1.4.0** successfully pls share your code. — Edwin Vivek N, Jul 01 '15 at 10:49

score 0 · Answer 3 · answered Dec 03 '15 at 13:52

Infact, you could use the databricks/spark-csv package to handle tsv files too.

For example,

data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")

A host of options is provided here - databricks-spark-csv#features

Reading Text file in SparkR 1.4.0

3 Answers3

Linked