4

I have a spark job that needs to store the last time it ran to a text file. This has to work both on HDFS but also on local fs (for testing).

However it seems that this is not at all so straight forward as it seems.

I have been trying with deleting the dir and getting "can't delete" error messages. Trying to store a simple sting value into a dataframe to parquet and back again.

this is all so convoluted that it made me take a step back.

What's the best way to just store a string (timestamp of last execution in my case) to a file by overwriting it?

EDIT:

The nasty way I use it now is as follows:

sqlc.read.parquet(lastExecution).map(t => "" + t(0)).collect()(0)

and

sc.parallelize(List(lastExecution)).repartition(1).toDF().write.mode(SaveMode.Overwrite).save(tsDir)
Havnar
  • 2,558
  • 7
  • 33
  • 62
  • When you use local fs, will it always be with Spark in local mode (i.e. not distributed across a cluster)? Another way of saying, when you are running on a cluster, will HDFS always be present? – mattinbits Jun 02 '16 at 12:49
  • Yes local mode is just in local mode (run from my IDE) On the cluster we will be using HDFS and Yarn – Havnar Jun 02 '16 at 12:52
  • Have you tried putting the string value into an RDD of one row and writing it using the Spark API? can you show some code of what you've tried? – mattinbits Jun 02 '16 at 12:54

1 Answers1

0

This sounds like storing simple application/execution metadata. As such, saving a text file shouldn't need to be done by "Spark" (ie, it shouldn't be done in distributed spark jobs, by workers).

The ideal place for you to put it is in your driver code, typically after constructing your RDDs. That being said, you wouldn't be using the Spark API to do this, you'd rather be doing something as trivial as using a writer or a file output stream. The only catch here is how you'll read it back. Assuming that your driver program runs on the same computer, there shouldn't be a problem.

If this value is to be read by workers in future jobs (which is possibly why you want it in hdfs), and you don't want to use the Hadoop API directly, then you will have to ensure that you have only one partition so that you don't end up with multiple files with the trivial value. This, however, cannot be said for the local storage (it gets stored on the machine where the worker executing the task is running), managing this will simply be going overboard.

My best option would be to use the driver program and create the file on the machine running the driver (assuming it is the same that will be used next time), or, even better, to put it in a database. If this value is needed in jobs, then the driver can simply pass it through.

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • A database would be best, but I don't have access to one. If I could use a file output stream to both HDFS:/// and file:/// that would be great... – Havnar Jun 02 '16 at 12:49
  • It's fairly simple to directly write text to hdfs. Check answers to this question: http://stackoverflow.com/questions/16000840/write-a-file-in-hdfs-with-java – ernest_k Jun 02 '16 at 12:53
  • This is one of the approached I tried myself. however the "hdfs.delete( file, true )" returned with : "could not delete" and the code then fails because .storeAsText does not have an "overwrite" functionality. – Havnar Jun 02 '16 at 12:57
  • Do you know why the delete failed? Have you perhaps checked hadoop's logs? – ernest_k Jun 02 '16 at 13:01
  • This error was thrown locally, since it doesn't work there I don't want to push anything on to the cluster yet. – Havnar Jun 02 '16 at 13:25