1

BACKGROUND

I have a binary classification task where the data is highly imbalanced. Specifically, there are way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling data with label 0 to roughly match the size of data with label 1. I did this in a pig script. Instead of only sampling one chunk of training data, I did this 10 times to generate 10 data chunks to train 10 classifiers similar to bagging to reduce variance.

SAMPLE PIG SCRIPT

---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';

-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;

-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
                        trainingChunkiRaw::id AS id,
                        trainingChunkiRaw::label AS label,
                        dataFeatures::features AS features,
                        RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;

-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
                        id AS id,
                        label AS label,
                        features AS features;

STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');

In my real pig script, I do this 10 times to generate 10 data chunks.

PROBLEM

The problem I have is that if I choose to generate 10 chunks of data, there are so many mapper/reducer tasks, more than 10K. The majority of mappers do very little things (runs less 1 min). And at some point, the whole pig script is jammed. Only one mapper/reducer task could run and all other mapper/reducer tasks are blocked.

WHAT I'VE TRIED

  1. In order to figure out what happens, I first reduced the number of chunks to generate to 3. The situation was less severe. There were roughly 7 or 8 mappers running at the same time. Again these mappers did very little things (runs about 1 min).

  2. Then, I increased the number of chunks to 5, at this point, I observed the the same problem I have when I set the number of chunks to be 10. At some point, there was only one mapper or reducer running and all other mappers and reducers were blocked.

  3. I removed some part of script to only store id, label without features

    --------------------------------------------------------------------------
    -- generate training chunk i
    --------------------------------------------------------------------------
    -- subsampling data with label 0
    labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
    
    -- combine data with label 0 and label 1
    trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
    
    STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
    

This worked without any problem.

  1. Then I added the shuffling back

    --------------------------------------------------------------------------
    -- generate training chunk i
    --------------------------------------------------------------------------
    
    -- subsampling data with label 0
    labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
    
    -- combine data with label 0 and label 1
    trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
    trainingChunki = FOREACH trainingChunkiRaw GENERATE
                        id,
                        label,
                        features,
                        RANDOM() AS r;
    -- shuffle data
    trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
    trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
                        id AS id,
                        label AS label,
                        features AS features;
    
    STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
    

The same problem reappears. Even worse, at some point, there was no mapper/reducer running. The whole program hanged without making any progress. I added another machine and the program ran for a few minutes before it jammed again. Looks like there are some dependency issues here.

WHAT'S THE PROBLEM

I suspect there are some dependency which leads to deadlock. The confusing thing is that before shuffling, I already generate the data chunks. I was expecting the shuffling could be executed in parallel since these data chunks are independent with each other.

Also I noticed there are many mappers/reducers do very little thing (exists less than 1 min). In such case, I would imagine the overhead to launch mappers/reducers would be high, is there any way to control this?

  1. What's the problem, any suggestions?
  2. Is there standard way to do this sampling. I would imagine there are many cases where we need to do these subsampling like bootstrapping or bagging. So, there might be some standard way to do this in pig. I couldn't find anything useful online. Thanks a lot

ADDITIONAL INFO

  1. The size of table 'labelZeroTrainingData' is really small, around 16MB gziped. table 'labelZeroTrainingData' is also generated in the same pig script by filtering.
  2. I ran the pig script on 3 aws c3.2xlarge machines.
  3. table 'dataFeatures' could be large, around 15GB gziped.
  4. I didn't modify any default configuration of hadoop.
  5. I checked the disk space and memory usage. Disk space usage is around 40%. Memory usage is around 90%. I'm not sure memory is the problem. Since I was told if the memory is the issue, the whole task should fail.
xuan
  • 270
  • 1
  • 2
  • 15

1 Answers1

0

After a while, I think I figure out something. The problem is likely to be the multiple STORE statements there. Looks like pig script will be running in batch by default. So, for each chunk of the data, there is a job running which leads to lack of resource, e.g. slots for mapper and reducer. None of the job could finish because each needs more mapper/reducer slots.

SOLUTION

  1. use piggybank. There is a storage function called MultiStorage which might be useful in this case. I had some version incompatible issue between piggybank and hadoop. But it might work.
  2. Disable pig executing operations in batch. Pig tries to optimize the execution. I simply disable this multiquery feature by adding -M. So, when you run pig script, it looks like something pig -M -f pig_script.pg which executes one statement at a time without any optimization. This might not be ideal because no optimization is done. For me, it's acceptable.
  3. Use EXEC in pig to enforce certain execution order which is helpful in this case.
xuan
  • 270
  • 1
  • 2
  • 15