0

I am doing some work related to datafiles and indexfiles in spark shuffle mechanism. I have one doubt related to it: Can we merge or combine two datafiles (and two indexfiles) of two different spark-jobs (on same Rdds) in Apache Spark?

Any help? Thanks in advance!

info_tech
  • 21
  • 1
  • 5
  • you probably need to clarify this question. There are join and union operations on RDDs but I don't think you've left enough information for us to know which you want or if those would work for your use case. – nairbv May 27 '16 at 02:24
  • @Brian Thank you for your reply. I am working on spark to get refreshed data. I have source data in form of Mysql tables. I am using jdbc connection to get those data in spark. Then, i am doing some filtering, mapping and finally join between two Rdds. And then apply action on that joined rdd. Now, my Program is sleeping (thread sleep) for some time period and during that time, i am deleting data from original source tables and inserting new data. So, after thread sleep again i am calling action on same joined rdd. In a result i need merged data (both joined result) – info_tech May 27 '16 at 06:19
  • that seems to be a description of your whole program. I don't see a specific answerable question in there. – nairbv May 29 '16 at 15:42
  • @Brian The simple question is: is it possible to merge or combine two data files in case of sort shuffle mechanism in spark? – info_tech May 31 '16 at 07:42
  • ...and that was your original question in the subject. What do you mean by "combine?" You mean join or union or something else? Why wouldn't you be able to join or union two RDD regardless of how you had shuffled them? Are you trying to avoid an additional shuffle? What is the actual problem? – nairbv Jun 01 '16 at 14:50
  • sc.textFile("foo.txt").union( sc.textFile("other.txt")) ? maybe you want to throw in a .sortBy somewhere but I don't know why that matters. – nairbv Jun 01 '16 at 14:52
  • @Brian The combine means union of two datafiles and two indexfiles. I do not want to apply union on rdds. But, i want a union of those files. I am trying to do incremental loading of data. In first job i have old data (stored in datafile and indexfile) and then i have deleted that data from original source and insert new data. Now, in second job i have new data. So, i want both job's data. So, i am working on to do merge or union of datafiles and indexfiles. – info_tech Jun 02 '16 at 09:12

0 Answers0