0

I use Spark 2.0.2.

While learning the concept of writing a dataset to a Hive table, I understood that we do it in two ways:

  1. using sparkSession.sql("your sql query")
  2. dataframe.write.mode(SaveMode."type of mode").insertInto("tableName")

Could anyone tell me what is the preferred way of loading a Hive table using Spark ?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Metadata
  • 2,127
  • 9
  • 56
  • 127

2 Answers2

0

In general I prefer 2. First because for multiple rows you cannot build such a long sql and second because it reduces the chance of errors or other issues like SQL injection attacks.

In the same way that for JDBC I use PreparedStatements as much as possible.

0

Think in this fashion, we need to achieve updates on daily basis on hive.

This can be achieved in two ways

  1. Process all the data of the hive
  2. Process only effected partitions.

For the first option sql works like a gem, but keep in mind that the data should be less to process entire data.

Second option works well.If you want to process only effected partition. Use data.overwite.partitionby.path You should write the logic in such a way that it process only effected partitions. This logic will be applied to tables where data is in millions T billions records

loneStar
  • 3,780
  • 23
  • 40