Need architecture hint: Data replication into the cloud + data cleansing

Question

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.

To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.

Can you give me a hint which frameworks you would use for such an task?

score 1 · Answer 1 · answered Jun 23 '17 at 21:53

From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.

Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).

Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).

Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.

Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.

score 1 · Answer 2 · answered Jun 23 '17 at 23:51

The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.

These are a few things you should consider:

batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)

So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.

One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.

Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).

Vikram Patil · Answer 3 · 2017-09-07T10:28:39.050

Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Need architecture hint: Data replication into the cloud + data cleansing

3 Answers3