Currently, I have Pig script running on top of Amazon EMR to load a bunch of files from S3 and then I will do the filter processing and group the data into phone number
, so the data will be like (phonenumber:chararray, bag:{mydata:chararray})
. Next I will have to store each phone number into different S3 buckets (possibly buckets in different accounts that I have access to). Seems org.apache.pig.piggybank.storage.MultiStorage
is the best use at here, but it doesn't work, as there are 2 problems I am facing:
- There are a lot of phone numbers (approximate 20,000), to store each phone number into different S3 buckets is very very slow and the program is even out of memory.
- There is no way for me to look up my lookup table to decide where is the buckets to store into.
So I am wondering if anyone can help out? The second one probably can solve by written up my own UDF store function, but for the first one, how to solve it? Thanks.