4

I have a Python program running on some input data on 4GB RAM 32-bit 12.04 Ubuntu. The time and space complexity of the program both are O(n). When input data is around 100 kb it completes the execution in about 4sec with peak RAM consumption being 0.5%(using 'top' command in LINUX). However, when I tried the input data of sizes 500kB, 2.5MB and 16 MB, the process did not finish within 1 hour(in each case, I had to cancel using Cntrl C) and the memory consumption was stuck at 1.6% (i.e. around 64MB in each case). Can I allocate this Python process with more RAM memory somehow?

Note: I am implementing the Map Reduce job in Python using 'mrjob' library made by Python.

Following is the log of successful execution when input csv file is 100 kB.

   ankit@ubuntu:~/mrj/mrjo/mrjob/examples$ python mt1.py as.txt > asop.txtusing configs in /home/ankit/.mrjob.conf
creating tmp directory /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269
> /usr/bin/python mt1.py --step-num=0 --mapper /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/input_part-00000
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-0-mapper_part-00000
> /usr/bin/python mt1.py --step-num=0 --mapper /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/input_part-00001
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-0-mapper_part-00001
Counters from step 1:
  (no counters found)
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-0-mapper-sorted
> sort /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-0-mapper_part-00000 /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-0-mapper_part-00001
> /usr/bin/python mt1.py --step-num=0 --reducer /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/input_part-00000
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
> /usr/bin/python mt1.py --step-num=1 --mapper /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/input_part-00000
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-1-mapper_part-00000
Counters from step 2:
  (no counters found)
Moving /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/step-1-mapper_part-00000 -> /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/output/part-00000
Streaming final output from /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269/output
removing tmp directory /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.094809.251269

This is the execution log and traceback when input csv file is 2.5 MB.

ankit@ubuntu:~/mrj/mrjo/mrjob/examples$ python mt1.py matlabsample.csv > matsamop.txt
using configs in /home/ankit/.mrjob.conf
creating tmp directory /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221
> /usr/bin/python mt1.py --step-num=0 --mapper /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/input_part-00000
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/step-0-mapper_part-00000
> /usr/bin/python mt1.py --step-num=0 --mapper /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/input_part-00001
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/step-0-mapper_part-00001
Counters from step 1:
  (no counters found)
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/step-0-mapper-sorted
> sort /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/step-0-mapper_part-00000 /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/step-0-mapper_part-00001
> /usr/bin/python mt1.py --step-num=0 --reducer /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/input_part-00000
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
> /usr/bin/python mt1.py --step-num=1 --mapper /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/input_part-00000
writing to /home/ankit/mrj/mrjo/examples/mt1.ankit.20121224.065246.700221/step-1-mapper_part-00000
^CTraceback (most recent call last):


  File "mt1.py", line 311, in <module>
    Motion_Tagging.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/job.py", line 545, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/job.py", line 561, in execute
    self.run_job()
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/job.py", line 631, in run_job
    runner.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/runner.py", line 490, in run
    self._run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/local.py", line 193, in _run
    combiner_args=combiner_args)
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/local.py", line 488, in _invoke_step
    self._wait_for_process(proc_dict, step_num)
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/local.py", line 657, in _wait_for_process
    tb_lines = find_python_traceback(stderr_lines)
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/parse.py", line 171, in find_python_traceback
    for line in lines:
  File "/usr/local/lib/python2.7/dist-packages/mrjob-0.3.5-py2.7.egg/mrjob/local.py", line 680, in _process_stderr_from_script
    for line in stderr:
KeyboardInterrupt
John Vandenberg
  • 474
  • 6
  • 16
  • 2
    I doubt your code has a memory problem; 64MB is *not* much memory usage at all. – Martijn Pieters Dec 24 '12 at 09:55
  • 2
    I don't think you have an issue with RAM allocation - normally, the interpreter just takes what memory it needs. I think your program really does not run in `O(n)`. If you post your code, we can take a look. – inspectorG4dget Dec 24 '12 at 09:56
  • @inspectorG4dget : Thanks for your response. Unfortunately, I cant post the code for some reasons but I assure you that it's time and space complexity both are O(n). And yes its a Map Reduce job. –  Dec 24 '12 at 10:16
  • @Martijn Pieters : Thats exactly what I thought when I observed it in top processes. –  Dec 24 '12 at 10:21
  • @AnkitAgrawal: evidently, it is not working *somewhere*. Add logging output to verify that it is still doing work. The traceback you get when you hit `CTRL-C` should also provide clues as to where it was doing work. – Martijn Pieters Dec 24 '12 at 10:26
  • 4
    If you cannot share code, then this question is too localized for us to be able to help you, nor will it ever be helpful to others. I've voted to close it on those grounds. – Martijn Pieters Dec 24 '12 at 10:29

2 Answers2

1

You don't "allocate memory to a Python process", you use bigger structures in the Python program. At a fundamental level your algorithm is probably flawed in such a way that it doesn't take advantage of memory that is available.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 2
    The algorithm doesn't need to be flawed just because it doesn't more available memory. It's more likely the algorithm enters an infinite loop or w/e, it may not need any more memory. – phant0m Dec 24 '12 at 09:58
  • @Ignacio : The Python script that I have written takes csv file as an input, stores each transaction/tuple as a list inside main list. The rest of the script involves performing computations on elements in the lists. i.e. I am only dealing with list data -structure. Can you please throw more light on what do you mean by using bigger structures? –  Dec 24 '12 at 10:10
0

FYI, This is not a code level solution. However you can go through the link below and get some deep thoughts how python memory implementation works, and how the problem was fixed. It also discusses additional areas where Python's memory management can be improved. Hope it will be useful.

http://www.evanjones.ca/memoryallocator/

Darknight
  • 1,132
  • 2
  • 13
  • 31