5

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.

s3_single_file_inpath='s3a://bucket-name/file_name'

indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)

The sqlContext.read.load command above fails with

Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

But the second one succeeds?

Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.

It is not clear to me when to use which of these to use when. Is there a clear distinction between these?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
makansij
  • 9,303
  • 37
  • 105
  • 183

2 Answers2

4

Why is difference between sqlContext.read.load and sqlContext.read.text?

sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.

With sqlContext.read.load you can define the data source format using format parameter.


Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.

As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):

NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.

That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.

Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.

https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.


It is not clear to me when to use which of these to use when. Is there a clear distinction between these?

There's none actually iff you use Spark 2.x.

If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell


As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • so you mean that irrespective of whether you use `spark.read.csv()`/`spark.read.text()` or `spark.read.load()`, it is one and the same thing and has no distinction for Spark 2.x ? – cph_sto Nov 19 '18 at 13:59
  • 1
    Pretty much yes. There are some optimizations that data sources can use to make the loading more effective (e.g. schema inference), but that's just minor point in the discussion. – Jacek Laskowski Nov 19 '18 at 14:49
  • Oh that's very reassuring. Many thanks Jacek. If I may ask you another question - while using these import functions, can we specify the `numberOfPartitions`? Or is it that we have to resort to `repartition()` afterwards? If you want, I can post it as another question and intimate you so that you could answer it there. Please let me know. – cph_sto Nov 19 '18 at 14:57
  • Number of partitions? No. It's orthogonal to a data source so it does not have to deal with such low-level things like partitions. – Jacek Laskowski Nov 19 '18 at 16:47
  • 1
    Well, your reasoning is quite high level for me to understand at the moment :) I will investigate it further. Thank you so much Jacek. – cph_sto Nov 19 '18 at 19:30
  • Hi Jacek, would you mind answering it - https://stackoverflow.com/questions/53431989/pyspark-partitioning-and-hashing-multiple-dataframes-then-joining your perspective will be very valuable. – cph_sto Nov 22 '18 at 15:29
2

The difference is:

  • text is a built-in input format in Spark 1.6
  • com.databricks.spark.csv is a third party package in Spark 1.6

To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide

 --packages com.databricks:spark-csv_2.10:1.5.0  

argument with spark-submit / pyspark commands.

Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115