BACKGROUND
I have a binary classification task where the data is highly imbalanced. Specifically, there are way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling data with label 0 to roughly match the size of data with label 1. I did this in a pig script. Instead of only sampling one chunk of training data, I did this 10 times to generate 10 data chunks to train 10 classifiers similar to bagging to reduce variance.
SAMPLE PIG SCRIPT
---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;
-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
trainingChunkiRaw::id AS id,
trainingChunkiRaw::label AS label,
dataFeatures::features AS features,
RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
In my real pig script, I do this 10 times to generate 10 data chunks.
PROBLEM
The problem I have is that if I choose to generate 10 chunks of data, there are so many mapper/reducer tasks, more than 10K. The majority of mappers do very little things (runs less 1 min). And at some point, the whole pig script is jammed. Only one mapper/reducer task could run and all other mapper/reducer tasks are blocked.
WHAT I'VE TRIED
In order to figure out what happens, I first reduced the number of chunks to generate to 3. The situation was less severe. There were roughly 7 or 8 mappers running at the same time. Again these mappers did very little things (runs about 1 min).
Then, I increased the number of chunks to 5, at this point, I observed the the same problem I have when I set the number of chunks to be 10. At some point, there was only one mapper or reducer running and all other mappers and reducers were blocked.
I removed some part of script to only store id, label without features
-------------------------------------------------------------------------- -- generate training chunk i -------------------------------------------------------------------------- -- subsampling data with label 0 labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO; -- combine data with label 0 and label 1 trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData; STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
This worked without any problem.
Then I added the shuffling back
-------------------------------------------------------------------------- -- generate training chunk i -------------------------------------------------------------------------- -- subsampling data with label 0 labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO; -- combine data with label 0 and label 1 trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData; trainingChunki = FOREACH trainingChunkiRaw GENERATE id, label, features, RANDOM() AS r; -- shuffle data trainingChunkiShuffledRandom = ORDER trainingChunki BY r; trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE id AS id, label AS label, features AS features; STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
The same problem reappears. Even worse, at some point, there was no mapper/reducer running. The whole program hanged without making any progress. I added another machine and the program ran for a few minutes before it jammed again. Looks like there are some dependency issues here.
WHAT'S THE PROBLEM
I suspect there are some dependency which leads to deadlock. The confusing thing is that before shuffling, I already generate the data chunks. I was expecting the shuffling could be executed in parallel since these data chunks are independent with each other.
Also I noticed there are many mappers/reducers do very little thing (exists less than 1 min). In such case, I would imagine the overhead to launch mappers/reducers would be high, is there any way to control this?
- What's the problem, any suggestions?
- Is there standard way to do this sampling. I would imagine there are many cases where we need to do these subsampling like bootstrapping or bagging. So, there might be some standard way to do this in pig. I couldn't find anything useful online. Thanks a lot
ADDITIONAL INFO
- The size of table 'labelZeroTrainingData' is really small, around 16MB gziped. table 'labelZeroTrainingData' is also generated in the same pig script by filtering.
- I ran the pig script on 3 aws c3.2xlarge machines.
- table 'dataFeatures' could be large, around 15GB gziped.
- I didn't modify any default configuration of hadoop.
- I checked the disk space and memory usage. Disk space usage is around 40%. Memory usage is around 90%. I'm not sure memory is the problem. Since I was told if the memory is the issue, the whole task should fail.