1

I have this legacy system which streams records into a queue (Azure Event Hubs) in the pace they are changed and, every 24h, another process reads all records and dumps them all into the stream. This mechanism let's any consumer to recreate the data by reading last +24h of this stream.

I'm using Spark to read this stream and recreate a view of the original data (I can't read it directly, unfortunately). This data will be joined by other spark jobs, both for batching and for streaming.

What are my options in terms of suitable storage backend?

Is Delta Table suitable for this kind of load? Should I use a No Sql backend (eg Mongo DB) instead?

Igor Gatis
  • 4,648
  • 10
  • 43
  • 66
  • this is a broad question... It's everything about functional & non-functional requirements - how many jobs will use that data, how much are you ready to pay for hosting the data, maintain databases, etc. – Alex Ott Feb 23 '21 at 16:31
  • It would be nice to hear about different scenarios and how each of these parameters affect the outcome. Specially useful for people like me with little experience. – Igor Gatis Feb 23 '21 at 22:38

0 Answers0