I did some joining operation, both mapside and reduce side, with small dataset. I am looking for a gigabyte scale publicly available dataset for measuring performance on cluster. Do you guys any possible many-to-many join dataset?
Asked
Active
Viewed 61 times
0
-
1Here is similar http://stackoverflow.com/questions/10843892/download-large-data-for-hadoop. I guess that you find something appropriate. GL – www Feb 27 '13 at 21:47
-
Thanks @WawrzyniecSz.! I am currently looking into those, yet to find any dataset with multiple files which I can use for joining use hadoop. – Yeameen Feb 27 '13 at 23:47
-
You could add some dummy field like rand(1,row_number/1M) to any of those data sets in map only job. Copy result. Two data sets with many to many relation is ready! – www Feb 28 '13 at 13:33