0

I did some joining operation, both mapside and reduce side, with small dataset. I am looking for a gigabyte scale publicly available dataset for measuring performance on cluster. Do you guys any possible many-to-many join dataset?

Yeameen
  • 833
  • 7
  • 8
  • 1
    Here is similar http://stackoverflow.com/questions/10843892/download-large-data-for-hadoop. I guess that you find something appropriate. GL – www Feb 27 '13 at 21:47
  • Thanks @WawrzyniecSz.! I am currently looking into those, yet to find any dataset with multiple files which I can use for joining use hadoop. – Yeameen Feb 27 '13 at 23:47
  • You could add some dummy field like rand(1,row_number/1M) to any of those data sets in map only job. Copy result. Two data sets with many to many relation is ready! – www Feb 28 '13 at 13:33

0 Answers0