14

I just tried to run this script with Python 3.3. Unfortunately it's about twice as slow than with Python 2.7.

#!/usr/bin/env python

from sys import stdin

def main():
    for line in stdin:
        try:
            fields = line.split('"', 6)
            print(fields[5])
        except:
            pass

if __name__ == '__main__':
    main()

Here are the results:

$ time zcat access.log.gz | python3 -m cProfile ./ua.py > /dev/null

real    0m13.276s
user    0m18.977s
sys     0m0.484s

$ time zcat access.log.gz | python2 -m cProfile ./ua.py > /dev/null

real    0m6.139s
user    0m11.693s
sys     0m0.408s

Profiling shows that the additional time is spend in print:

$ zcat access.log.gz | python3 -m cProfile ./ua.py | tail -15
   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1594(_handle_fromlist)
   196806    0.234    0.000    0.545    0.000 codecs.py:298(decode)
        1    0.000    0.000   13.598   13.598 ua.py:3(<module>)
        1    4.838    4.838   13.598   13.598 ua.py:6(main)
        1    0.000    0.000   13.598   13.598 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {built-in method hasattr}
  4300456    4.726    0.000    4.726    0.000 {built-in method print}
   196806    0.312    0.000    0.312    0.000 {built-in method utf_8_decode}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  4300456    3.489    0.000    3.489    0.000 {method 'split' of 'str' objects}

$ zcat access.log.gz | python2 -m cProfile ./ua.py | tail -10
   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.573    6.573 ua.py:3(<module>)
        1    3.894    3.894    6.573    6.573 ua.py:6(main)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  4300456    2.680    0.000    2.680    0.000 {method 'split' of 'str' objects}

How can I avoid this overhead? Has it something to do with UTF-8?

thefourtheye
  • 233,700
  • 52
  • 457
  • 497
Timo
  • 308
  • 2
  • 10
  • 3
    Well, `print` isn't a statement anymore in python3 so some overhead is expected. This could probably be lessened by using `sys.stdout.write` instead - or even better, by first creating a list of your strings and then using `sys.stdout.writelines`, assuming memory is not an issue. In both cases, you would need to add the newlines created by `print` yourself. – l4mpi Dec 14 '13 at 13:56

1 Answers1

12

Python 3 decodes data read from stdin and encodes again to stdout; it is not so much the print() function that is slower here as the unicode-to-bytes conversion and vice-versa.

In your case you probably want to bypass this and deal with bytes only; you can access the underlying BufferedIOBase implementation through the .buffer attribute:

from sys import stdin, stdout

try:
    bytes_stdin, bytes_stdout = stdin.buffer, stdout.buffer
except AttributeError:
    bytes_stdin, bytes_stdout = stdin, stdout

def main():
    for line in bytes_stdin:
        try:
            fields = line.split(b'"', 6)
            bytes_stdout.write(fields[5] + b'\n')
        except IndexError:
            pass

if __name__ == '__main__':
    main()

You'll now have to use stdout.write() as print() insists on writing to the stdout TextIOBase implementation.

Note that the .split() now uses a bytes literal b'"' and we write a bytes-literal b'\n' as well (which normally would be taken care of by print()).

The above is compatible with Python 2.6 and up. Python 2.5 doesn't support the b prefix.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343