1

I can run python mrjob locally and it's much faster. But when I look into the output results, it's missing data, and lost a lot of data. I'm wondering whether this is because there is a function in my code cost longer time to run, and therefore all the data after that results didn't appear in the output.

Here is my mockup code (cannot show real code) and my description:

def core_logic(all_records):
    my_dct = {}
    for name, single_record in all_records.items():
        for i in range(410):
            if i < 10:
                my_dct[name]['color'].append('')
                my_dct[name]['flavor'].append('')
                my_dct[name]['quality'].append('')
            else:
                frozen_ratio = 0.7
                if frozen_ratio < 0.5:
                    my_dct[name]['color'].append('')
                    my_dct[name]['flavor'].append('')
                    my_dct[name]['quality'].append('Not Frozen')
                else:
                    food_score = a_complex_function(single_record[name])  # Problem started here

I have a python file that contains map reduce job, and it will call this core_logic function in reducer. After checking the output, I found that, all the food_score is missing, and in order to get this score, I had to call a_complex_function from another file. That is a real complex function and takes time to run.

The data input for this core logic all_records is a dictionary.

Now I strongly suspect whether this is because when code is running in parallel, in map reduce, if some method takes time to run, the data won't be recorded? But I tried to add time.sleep(2000) after a_complex_function(), the data is still missing. This function normally takes shorter than 2 seconds to finish. I also tried to print out the data in each step, all the previous steps are fine, but just right after a_complex_function(), the data lost

I'm running the mapreduce on my own laptop. Through command line python map_reduce_job.py < test_data/test_file.csv > outfile.new

Have you ever met this type of problem? Is there any solution?

Cherry Wu
  • 3,844
  • 9
  • 43
  • 63

0 Answers0