3

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.

If not what are the alternatives for real time data processing or analysis?

Ketan
  • 89
  • 2
  • 8

3 Answers3

4

Real-time App with Map-Reduce Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.

We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.

What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.

Our problems have just begun. Let us iterate few requirements that complicate our problem.

  • What if the temperature sensor is placed inside a nuclear plant and our code create alarms. Creating alarms after one hour has elapsed may not be the best way to handle it. Can we get alerts within 1 second?
  • What if you want the readings calculated at hour boundary while it takes few seconds for data to arrive at the storage. Now you cannot start the job at your boundary, you need to watch the disk and trigger the job when data has arrived for the hour boundary.
  • Well, you can run Hadoop fast. Will the job finish within 1 seconds? Can we write the data to the disk, read the data, process it, and produce the results, and recombine with other 23 hours of data in one second? Now things start to get tight.
  • The reason you start to feel the friction is because you are not using the right tool for the Job. You are using the flat screwdriver when you have an Allen-wrench screw.

Stream Processing The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.

Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.

  • Some data naturally comes as a never-ending stream of events. To do batch processing, you need to store it, cut off at some time and processes the data. Then you have to do the next batch and then worry about aggregating across multiple batches. In contrast, streaming handles neverending data streams gracefully and naturally. You can have conditions, look at multiple levels of focus ( will discuss this when we get to windows), and also easily look at data from multiple streams simultaneously.
  • With streaming, you can respond to the events faster. You can produce a result within milliseconds of receiving an event ( update). With batch this often takes minutes.
  • Stream processing naturally fit with time series data and detecting patterns over time. For example, if you are trying to detect the length of a web session in a never-ending stream ( this is an example of trying to detect a sequence), it is very hard to do it with batches as some session will fall into two batches. Stream processing can handle this easily. If you take a step back and consider, the most continuous data series are time series data. For example, almost all IoT data are time series data. Hence, it makes sense to use a programming model that fits naturally.
  • Batch lets the data build up and try to process them at once while stream processing data as they come in hence spread the processing over time. Hence stream processing can work with a lot less hardware than batch processing.
  • Sometimes data is huge and it is not even possible to store it. Stream processing let you handle large fire horse style data and retain only useful bits.
  • Finally, there are a lot of streaming data available ( e.g. customer transactions, activities, website visits) and they will grow faster with IoT use cases ( all kind of sensors). Streaming is a much more natural model to think about and program those use cases.
Ajay Kharade
  • 1,469
  • 1
  • 17
  • 31
  • I see uncanny similarity between your answer and this post https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97!!! – KGhatak Oct 12 '20 at 21:04
1

In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration

You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.

catpaws
  • 2,263
  • 16
  • 18
1

Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.

Here's the good reference link to understand a similar use case: https://www.youtube.com/watch?v=3NEQV5mjKfY

Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.

Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:

Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid

Jainik
  • 2,352
  • 1
  • 19
  • 27