Incrementally adding to a Hive table w/Scala + Spark 1.3

Question

Our cluster has Spark 1.3 and Hive There is a large Hive table that I need to add randomly selected rows to. There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.

When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)

I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds. One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.

Any feedback would be great. Thanks

do you have the choice to move to a recent Spark version? then you could do: `yourDataFrame.writer().mode(SaveMode.Append).saveAsTable("YourTableName"` — user1314742, Apr 26 '16 at 14:45
It will not be until next month till the upgrade to Spark 1.5 — KBA, Apr 26 '16 at 15:40

user1314742 · Accepted Answer · 2016-05-01T18:27:17.170

2

I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:

in Hive:

create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'

And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:

df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)

edited May 01 '16 at 18:27

answered Apr 26 '16 at 16:02

user1314742

2,865
3
28
34

1

thanks alot, that helped. I copied the schema of the table I want to add to using `LIKE tablename` in Hive. Note for anyone out there, due to a firewall, I had to download the databricks and apache jar files and add them to the command line when I did spark-submit. `spark-submit --master yarn-client --class Main main.jar --jars spark-csv_2.10-1.4.0.jar,commons-csv-1.2.jar` Right now the files are being written for each iteration. What would you suggest for loading them? In the Scala,add them to a hive table in memory and unionAll with the original I want to add to and saving that? – KBA Apr 27 '16 at 03:59
Note2 for anyone out there, I also had to add a line to my sbt file: In `libraryDependencies ++= Seq(`` "com.databricks" %% "spark-csv" % "1.4.0"` – KBA Apr 27 '16 at 04:00
1

Sorry I could not understand your question "What would you suggest for loading them", what do you mean? actually, since they are in the same location as the external table, they are loaded automatically, you do not have to do anything to load them. That's why we create external table – user1314742 Apr 27 '16 at 09:56
The csv files are written to my hadoop or hdfs directory inside individual folders. I created the external table in the Hive console but it is empty. I am missing the part on putting this together. Thanks a lot – KBA Apr 27 '16 at 15:52
1

as mentioned in my response, you should save them into the same location of your external table, i.e in my response I referred to it as `/SOME/HDFS/LOCATION` . this will let hive scan all files located here when reading the data from table. – user1314742 Apr 27 '16 at 16:18
Thanks - I did that, that is, used the same hdfs directory for the writing csv, and location of the created external table. do I need to do anything for hive to scan the files besides this? In my hdfs directory it is like this `/tempcsv/data1csv/part0000`,`/tempcsv/dataX.csv/part000`, etc – KBA Apr 27 '16 at 17:01
yes you are rightm I ve missed that when saving to files it does create directory and saves the files inside. I ve edited my answer, please let me know iff it is working for you – user1314742 May 01 '16 at 18:22

Incrementally adding to a Hive table w/Scala + Spark 1.3

1 Answers1