0

I'm moving data from my postgres database to kafka and in the middle doing some transformations with spark. I Have 50 tables and for each table i have transformations totally different from the others. So, i want to know how is the best way to structure my spark structured streaming code. I think in three options:

  1. To Put all the logical of read and write this 50 tables in one object and call only this object.

  2. Create 50 different objects for each table and in a new object create a main method calling each of 50 objects and after call spark.streams.awaitAnyTermination()

  3. Submit individually each of these 50 objects via spark submit

If exist another better option, please talk to me.

Thank you

Luan Carvalho
  • 190
  • 2
  • 10

1 Answers1

1

Creating single object as per your approach 1 does not look good. It will be difficult to understand and maintain.

Between step2 and step3, I would still prefer 3rd. Having separate jobs will be a bit of hassle to maintain (managing deployment and structuring out the common code), but if done well it will give us more flexibility. We could easily undeploy a single table if needed. Also any subsequent deployments or changes would mean deploying only the concerned table flows. The other existing table pipelines will keep working as it is.

Rishabh Sharma
  • 747
  • 5
  • 9
  • Nice, but one question, its ok to send to spark a lot of multiple spark structured streaming jobs to run in paralel ? How can i send these 50 jobs ? I put all 50 jars in one spark submit --jars option ? I put this option, but i really dont know if it is a viable approach – Luan Carvalho Aug 10 '20 at 03:50
  • 1
    It is okay to setup multiple spark structured streaming jobs. To me I would have a deployment script that takes in name of the job to deploy or **all** flag which means it will deploy all 50 jobs. Deployment of a job would include building of the jar, config management as well as submitting it to spark cluster. – Rishabh Sharma Aug 10 '20 at 06:03