0

I am trying to convert my perl one-liners to pyp. My first attempt was given to me kindly as the answer to another question as

pyp "mm | p if n==0 else (p[:-2] + [(int(x)%12) for x in p[-2:]]) | mm"

However this turns out to be amazingly slow. If I create a test file using

for j in xrange(50000):
    print ",".join(str(i) for i in [random.choice(xrange(1000)) for i in xrange(8)])

and then run

time (cat testmedium.txt |~/.local/bin/pyp "mm | p if n==0 else (p[:-2] + [(int(x)%12) for x in p[-2:]]) | mm" > /dev/null)

I get

real    1m27.889s
user    1m26.941s
sys 0m0.688s

However the equivalent in perl is almost instant.

time (cat testmedium.txt |perl -l -a -F',' -p -e'if ($. > 1) { $F[6] %=12; $F[7] %= 12;$_ = join(q{,}, @F[6,7]) }' > /dev/null)

real    0m0.196s
user    0m0.192s
sys 0m0.012s

For larger test files the difference is even more dramatic.

Community
  • 1
  • 1
marshall
  • 2,443
  • 7
  • 25
  • 45
  • 8
    PErl and python's interpreters do *not* work in the same way. Something that is fast with one can be the worst approach in the other. If you told us what you are trying to achieve we could probably provide a fast pythonic version. – Bakuriu May 05 '13 at 18:32
  • @Bakuriu I think the pyp code must be doing something odd as it also uses a huge amount of memory (728MB?) where I would expect it to process the lines pretty much on the fly. The goal is basically to take the input of comma separated numerical values and to output it in the same format except with two of the numbers in each line given modulo 12. The linked question has some more small details. – marshall May 05 '13 at 18:40
  • 2
    Did you try to profile something like `pyp "mm | mm"`, to check whether it's pyp itself taking the time using the "pipes"? – Bakuriu May 05 '13 at 18:58
  • I'd also be curious if changing the `(int(x)%12)` to `str(int(x)%12)` had any effect in `pyp`. – Amber May 05 '13 at 18:59
  • @Bakuriu You are right. time (~/.local/bin/pyp "mm | mm" < testmedium.txt) is very slow too! – marshall May 05 '13 at 19:05
  • OK, so it's pyp itself that is slow. It's not exactly unexpected, even from reading the docs you realize it's doing a lot of magic. :-) – Lennart Regebro May 05 '13 at 19:07
  • Using an explicit `p.split(',')` and `','.join(p)` instead of `mm` produce any changes in the timings? – Bakuriu May 05 '13 at 19:07
  • @Bakuriu ~/.local/bin/pyp "p.split(',') | ','.join(p)" < testmedium.txt > /dev/null suffers from the same problem. – marshall May 05 '13 at 20:31
  • 1
    However `python -c "import sys;print '\n'.join(','.join(x.split(',')) for x in sys.stdin)" < test.txt > result.txt` does not. So yeah, definitely something in `pyp`. – Amber May 05 '13 at 22:53

2 Answers2

4

This code...

import sys

for index,line in enumerate(sys.stdin):
    if index == 0:
        print line
    else:
        values = line.split(',')
        values[-2:] = [str(int(x)%12) for x in values[-2:]]
        print ','.join(values)

runs in under a second for me (using a test file generated with the same method you did):

$ time (cat test.txt | python foo.py > /dev/null)

real    0m0.363s
user    0m0.339s
sys     0m0.032s

So if you're running into issues, it's probably an inefficiency with something pyp is trying to do.

Amber
  • 507,862
  • 82
  • 626
  • 550
  • Also creating a new `values` `list`, like in the `pyp` version, doesn't change the timings by much, so that's not the problem. – Bakuriu May 05 '13 at 18:57
  • I think this is just a design flaw in pyp sadly. It seems unsuited to processing large files presently. – marshall May 15 '13 at 10:12
0

This is an indirect answer to your question @marshall.

First, I would say that for me, the biggest advantage of pyp is not having to learn another language and I don't generally deal with large amounts of data, so its a good fit for my needs. Also, I understand that there have also been some speed orientated optimizations to pyp which may have affected the problem you describe.

I wondered if pypy might provide a faster version of pyp so I created an alias for pyp:

alias 'pl=pypy /usr/bin/pyp'

Then I ran this command with both pyp and pl

lr | pl "'doc',p, p.replace('e','EEE')+'.xpg' | pp.reverse() | ''.join(p)" | pl "d|u"

where lr is an alias for ls -R + ls -A just to create a long recursive list to time the operation.

The results were 8.04 seconds for pyp using Python 2.7.6 and 4.46 seconds for the pl alias. For a much larger set of directories it was 470 and 250 seconds. Python runs at 100% of one core during this operation as does PyPy.

So if you have pypy on your system there would seem to be a substantial performance gain possible with a simple alias.

John 9631
  • 527
  • 5
  • 13