1

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.

In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).

But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?

Also how can I involve graphs in this ?

Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?

  • 1
    Such a question is not appropriate for Stack overflow. Plus, it has been asked numerous times before, e.g. http://stackoverflow.com/questions/3953787/using-hadoop-map-reduce-for-programming-language-design-course-project http://stackoverflow.com/questions/4894396/hadoop-machine-learning-data-mining-project-idea http://stackoverflow.com/questions/1375102/data-mining-project-ideas?rq=1 – Has QUIT--Anony-Mousse Nov 13 '12 at 09:04

2 Answers2

1

I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.

Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.

Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.

Edit: If you are a student, R Revolution has a free academic edition.

Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.

iTech
  • 18,192
  • 4
  • 57
  • 80
0

As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:

http://infolab.stanford.edu/~ullman/mmds.html

This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.

Hope it helps!

Renata Ghisloti
  • 547
  • 6
  • 13