1

I am working with Spark scala shell and trying to create dataframe and datasets from a text file.

For getting datasets from a text file, there are two options, text and textFile methods as follows:

scala> spark.read.
csv   format   jdbc   json   load   option   options   orc   parquet   schema   table   text   textFile

Here is how i am gettting datasets and dataframe from both these methods:

scala> val df = spark.read.text("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.DataFrame = [value: string]

scala> val df = spark.read.textFile("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.Dataset[String] = [value: string]

So my question is what is the difference between the two methods for text file?

When to use which methods?

KayV
  • 12,987
  • 11
  • 98
  • 148
  • 1
    I would recommend you to read the [**Scaladoc**](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader) there the difference is explicitly stated. Apart from the return type _(`DataFrame` vs `Dataset[String]`)_, `text` will create additional column for each partition already present in the path, while `textFile` will ignore all partitions and simply load the file line by line. – Luis Miguel Mejía Suárez Mar 28 '19 at 13:45

1 Answers1

2

As I've noticed that they are almost having the same functionality,

It just that spark.read.text transform data to Dataset which is a distributed collection of data, while spark.read.textFile transform data to Dataset[Type] which is consist of Dataset organized into named columns.

Hope it helps.

Rex
  • 558
  • 2
  • 9