27

Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table.

My hope is that in the answers to this question we can aggregate information that would be helpful to Spark developers who want more control over how Spark saves tables and, perhaps, provide a foundation for improving Spark's documentation.

Sim
  • 13,147
  • 9
  • 66
  • 95
  • This is a bit vague and open-ended for SO, though. What are you looking for? Just how to save to hive? – Justin Pihony Jul 18 '15 at 02:48
  • @JustinPihony I see how someone could misread the title. I updated it to make it more explicit. Thanks for your comment. To be clear, the question is not about how to save a Hive table. It's about the undocumented options that can be passed when saving a Hive table. – Sim Jul 18 '15 at 03:23
  • All option those are available for DataFrameWriter we can pass into option for example format,mode,partitionBy etc. Btw which option you are looking ? – hayat Aug 26 '19 at 06:27

6 Answers6

6

The reason you don't see options documented anywhere is that they are format-specific and developers can keep creating custom write formats with a new set of options.

However, for few supported formats I have listed the options as mentioned in the spark code itself:

Ashvjit Singh
  • 415
  • 4
  • 10
4

Take a look at https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/DeltaOptions.scala the class "DeltaOptions'

Currently, supported options are:

  • replaceWhere
  • mergeSchema
  • overwriteSchema
  • maxFilesPerTrigger
  • excludeRegex
  • ignoreFileDeletion
  • ignoreChanges
  • ignoreDeletes
  • optimizeWrite
  • dataChange
  • queryName
  • checkpointLocation
  • path
  • timestampAsOf
  • versionAsOf
Alexander Zwitbaum
  • 4,776
  • 4
  • 48
  • 55
  • 1
    Good to bring the Delta options into this, as Delta Lake popularity grows. – Sim Feb 23 '20 at 02:07
  • 1
    New link: https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/DeltaOptions.scala – Melkor.cz Jan 25 '22 at 12:20
0

According to the source code you can specify the path option (indicates where to store the hive external data in hdfs, translated to 'location' in Hive DDL). Not sure you have other options associated with saveAsTable but I'll be searching for more.

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
0

As per the latest spark documentation following are the options that can be passed while writing DataFrame to external storage using .saveAsTable(name, format=None, mode=None, partitionBy=None, **options) API

if you click on the source hyperlink on the right hand side in the documentation you can traverse and find details of the other not so clear arguments eg. format and options which are described under the class DataFrameWriter

so when the document reads options – all other string options it is referring to options which gives you following option as for spark 2.4.4

timeZone: sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values. If it isn’t set, it uses the default value, session local timezone.

and when it reads format – the format used to save it is referring to format(source)

Specifies the underlying output data source.

Parameters

source – string,

name of the data source, e.g. ‘json’, ‘parquet’.

hope this was helpful.

pprasad009
  • 508
  • 6
  • 9
  • These are just the options of one method of the Python API; there are many more. – Sim Dec 11 '19 at 16:54
-3

The difference is between the versions.

We have the following in spark2:

createOrReplaceTempView()
createTempView()
createOrReplaceGlobalTempView()
createGlobalView()

saveAsTable is deprecated in spark 2.

Basically these are divided depending on the availability of the table. Please refer to the link

sk79
  • 35
  • 10
  • 1
    The question seems to be focused on what options ca be passed to tables such as these, not what are all the methods that can be used to register a temp table (view) – SilviuC May 09 '19 at 22:31
-3

saveAsTable(String tableName)

Saves the content of the DataFrame as the specified table.

FYI -> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html

chandan gupta
  • 1,185
  • 1
  • 10
  • 7
  • The OP specifically wanted to know the various options available as part of the options method(s) of the DataFrameWriter class. This def specifically: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html#options-scala.collection.Map- – Vivek Sethi Jul 03 '19 at 07:15