POC for Hadoop in real time scenario

Question

I have a bit of a problem. I want to learn about Hadoop and how I might use it to handle data streams in real time. As such I want to build a meaningful POC around it so that I can showcase it when I have to prove my knowledge of it in front of some potential employer or to introduce it in my present firm.

I'd also want to mention that I am limited in hardware resources. Just my laptop and me :) I know the basics of Hadoop and have written 2-3 basic MR jobs. I want to do something more meaningful or real world.

Please suggest.

Thanks in advance.

If you want to apply and show me that you use (Hadoop) MR for realtime analysis, I would immediately kick your application into the trash bin. — Thomas Jungblut, Jan 12 '13 at 17:23
If you want to do something meaningful in realtime then use `Storm`, `Gridgain` or `Impala`. — Thomas Jungblut, Jan 12 '13 at 17:29
You can consider spinning up VM's on cloud if you are limited in hardware resources. And one more idea, if you just want MapReduce, you can try Amazon's Elastic Map Reduce. — paras_doshi, Jan 14 '13 at 03:08

Charles Menguy · Accepted Answer · 2013-01-12T18:24:02.407

I'd like to point a few things.

If you want to do a POC with just 1 laptop, there's little point in using Hadoop.

Also, as said by other people, Hadoop is not designed for realtime application, because there is some overhead in running Map/Reduce jobs.

That being said, Cloudera released Impala which works with the Hadoop ecosystem (specifically the Hive metastore) to achieve realtime performance. Be aware that to achieve this, it does not generate Map/Reduce jobs, and is currently in beta, so use it carefully.

So I would really advise going at Impala so you can still use an Hadoop ecosystem, but if you're also considering alternatives here are a few other frameworks that could be of use:

Druid : was open-sourced by MetaMarkets. Looks interesting, even though I've not used it myself.
Storm : no integration with HDFS, it just processes data as it comes.
HStreaming : integrates with Hadoop.
Yahoo S4 : seems pretty close to Storm.

In the end I think you should really analyze your needs, and see if using Hadoop is what you need, because it's only getting started in the realtime space. There are several other projects which could help you achieve realtime performance.

If you want ideas of projects to showcase, I suggest looking at this link. Her are some examples:

Finance/Insurance
- Classify investment opportunities as good or not e.g. based on industry/company metrics, portfolio diversity and currency risk.
- Classify credit card transactions as valid or invalid based e.g. location of transaction and credit card holder, date, amount, purchased item or service, history of transactions and similar transactions.
Biology/Medicine
- Classification of proteins into structural or functional classes
- Diagnostic classification, e.g. cancer tumours based on images
Internet
- Document Classification and Ranking
- Malware classification, email/tweet/web spam classification
Production Systems (e.g. in energy or petrochemical industries)
- Classify and detect situations (e.g. sweet spots or risk situations) based on realtime and historic data from sensors

I am aware of the frameworks available. I am looking for ideas :) — Kumar Vaibhav, Jan 12 '13 at 18:08
If you really want to stick with Hadoop, `Impala` seems like your best bet. You will not achieve realtime performance with vanilla Hadoop. — Charles Menguy, Jan 12 '13 at 18:09
As I said, I am not looking for technology suggestions but for project ideas. — Kumar Vaibhav, Jan 12 '13 at 18:16
Maybe I am asking too much but I am in search of something more concrete in problem definition. Sorry to disappoint you :) — Kumar Vaibhav, Jan 13 '13 at 08:07
These problems are very concrete real world examples, maybe you could describe what you are looking for exactly, can't be more concrete than that sorry. — Charles Menguy, Jan 13 '13 at 08:39

score 3 · Answer 2 · answered Feb 18 '13 at 14:18

If you want to get your hands dirty on a highly promising streaming framework, try BDAS SPARK streaming. Caution, this is not yet released, but you can play around in your laptop with the github version (https://github.com/mesos/spark/tree/streaming) There are many samples to get you started.

Also this has many advantages over existing frameworks, 1. It gives you an ability to combine real time and batch computation in one stack 2. It will give you a REPL where you can try your ad hoc queries in an interactive manner. 3. You can run this in your laptop in local mode. There are many other advantages, but these three, I believe will suffice your need to get started.

You might have to learn scala to try out the REPL :-(

For more information, check out http://spark-project.org/

The Spark Streaming Alpha was released on Feb 27, 2013 as part of [Spark 0.7](http://spark-project.org/spark-release-0-7-0/). If you're interested in learning more about Spark Streaming, check out the [Streaming Programming Guide](http://spark-project.org/docs/latest/streaming-programming-guide.html) and the [talk on Spark Streaming](https://www.youtube.com/watch?v=mKdm4NCtYgk) from the Spark Users Meetup group. The Berkeley AMPLab's free [Big Data Mini Course](http://ampcamp.berkeley.edu/big-data-mini-course-home/) also has an exercise that processes a live stream from Twitter. — Josh Rosen, Apr 21 '13 at 00:33

score 1 · Answer 3 · answered Jan 12 '13 at 17:30

1

Hadoop is a high throughput oriented framework suitable for batch processes. If you are interested in processing and analyze huge data sets real time please look into twitter storm.

answered Jan 12 '13 at 17:30

debarshi

327
1
11

score 1 · Answer 4 · answered Mar 15 '13 at 16:32

One of the classy problem that I'm sure is the most realtime than anything else. Option Trading. The key here is to watch for news,trends in twitter, facebook, youtube and then identify candidates for possible PUT or CALL. You will need a good skill and elaborate implementation of Mahout with Nutch/Lucene and then use trending data to understand the current situation and system should recommend bets (options).

score 0 · Answer 5 · answered Jan 17 '13 at 03:14

0

I'm clearly biased but I would also recommend to look at GridGain for anything real-time. GridGain is In-Memory Data Platform that provides ACID NoSQL datastore and fast in-memory MapReduce.

answered Jan 17 '13 at 03:14

Nikita Ivanov

406
3
5

score 0 · Answer 6 · edited Apr 07 '13 at 17:05

I think you can have a POC running, for example, an online/recursive algorithm for regression in mapreduce. But remember that this will just prove that your "learning rule" works. Maybe (never tried this) you can use the results in real time by telling your reducers to write them into a temporary file that can be read by another thread.

Also Mahout allows you to set your database in several different SequenceFiles. You may use this to simulate an online stream and classify/cluster your data set "online". You can even copy part of data to the folder with the other data before the algorithm started to run. Mahout in Action details how to do that.

See if one of the following datasets is to your taste: http://archive.ics.uci.edu/ml/datasets.html

code_cody97 · Answer 7 · 2018-11-10T12:26:23.140

0

If you want to build some real time application,then I will suggest you to use Apache Spark framework which is used for real time processing and also support polyglot API(Scala,Java,Python,R)

edited Nov 10 '18 at 12:26

answered Nov 10 '18 at 12:20

code_cody97

100
9

score -1 · Answer 8 · answered Jan 13 '13 at 13:16

-1

I was looking for something like this -

https://www.kaggle.com/competitions

These are well defined problems, many of them Big Data problems. And some of them require real time processing.

But thanks to all who answered.

answered Jan 13 '13 at 13:16

Kumar Vaibhav

2,632
8
32
54

1

Kaggle is a nice resource! It's in the area of "Predictive Analytics" and not necessarily all problems are a good fit for Map Reduce/Hadoop/Big Data. But I beleive Mahout (part of hadoop ecosystem) would be something I would try for relevant competitions. – paras_doshi Jan 14 '13 at 03:11
Agreed. Then what do you suggest? – Kumar Vaibhav Jan 14 '13 at 04:47
I had related query as you had. I wanted to create a Meaningful POC too. I was searching for it and landed at this SO thread. Let's hope someone points us to some resource. Meanwhile Personally I have completed playing with samples here: http://gettingstarted.hadooponazure.com/ – paras_doshi Jan 14 '13 at 18:26

POC for Hadoop in real time scenario

8 Answers8