Pexpect throws unicode decode error when make command is run to compile C libraries

Question

I am running make to compile C libraries in a python project and using python(python 3.3) pexpect for automation part. So the output of make command is read in chunks by pexpect and in one such chunk it throws the following error when the pexpect tries to convert (python 3 bytes) to (python3's str) type . The main problem is this issue is intermittent not occuring frequently.

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 1998-1999: unexpected end of data

--> Below sample code shows that when data contains multibyte character (i.e. special character or any unicode data). Pexpect fails to decode when it is processing partial data of multibyte character.

#!/usr/bin/python 
# -*- coding: utf-8 -*-
from base import pexpect

MAX_READ_CHUNK = 8

def run(cmd):
    child = pexpect.spawn(cmd, maxread=MAX_READ_CHUNK)
    while True:
       i = child.expect([pexpect.EOF,pexpect.TIMEOUT])

       if child.before:
          print(child.before)

       if i == 0: # EOF
           break
       elif i == 1: # TIMEOUT
           continue

    child.close()
    return child.exitstatus

############## Main ################
data='“HELLO WORLD”' 
#i.e. data = b'\xe2\x80\x9cabcd\xe2\x80\x9d'
print("Data in readable form = %s "%data)
print("Data in bytes         = %s \n\n"%data.encode('utf-8'))

run("echo %s"%data)

Following Traceback error is coming:

Data in readable form = “HELLO WORLD” 
Data in bytes         = b'\xe2\x80\x9cHELLO WORLD\xe2\x80\x9d' 


_cast_unicode() enc=[utf-8] s=[b'\xe2\x80\x9cHELLO'] 
_cast_unicode() enc=[utf-8] s=[b' WORLD\xe2\x80'] 
Traceback (most recent call last):
  File "test.py", line 33, in <module>
    run("echo %s"%data)
  File "test.py", line 11, in run
    i = child.expect([pexpect.EOF,pexpect.TIMEOUT])
  File "/home/test/Downloads/base/pexpect.py", line 1358, in expect
    return self.expect_list(compiled_pattern_list, timeout, searchwindowsize)
  File "/home/test/Downloads/base/pexpect.py", line 1372, in expect_list
    return self.expect_loop(searcher_re(pattern_list), timeout, searchwindowsize)
  File "/home/test/Downloads/base/pexpect.py", line 1425, in expect_loop
    c = self.read_nonblocking (self.maxread, timeout)
  File "/home/test/Downloads/base/pexpect.py", line 1631, in read_nonblocking
    return super(spawn, self).read_nonblocking(size=size, timeout=timeout)\
  File "/home/test/Downloads/base/pexpect.py", line 868, in read_nonblocking
    s2 = self._cast_buffer_type(s)
  File "/home/test/Downloads/base/pexpect.py", line 1614, in _cast_buffer_type
    return _cast_unicode(s, self.encoding)
  File "/home/test/Downloads/base/pexpect.py", line 156, in _cast_unicode
    return s.decode(enc)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 6-7:
 unexpected end of data

When MAX_READ_CHUNK value is changed to 9 in above code, it is working fine.

# Output When "MAX_READ_CHUNK = 9"
Data in readable form = “HELLO WORLD” 
Data in bytes         = b'\xe2\x80\x9cHELLO WORLD\xe2\x80\x9d' 


_cast_unicode() enc=[utf-8] s=[b'\xe2\x80\x9cHELLO '] 
_cast_unicode() enc=[utf-8] s=[b'WORLD\xe2\x80\x9d\r'] 
_cast_unicode() enc=[utf-8] s=[b'\n'] 
“HELLO WORLD”

How to handle this "UnicodeDecodeError: 'utf-8' codec can't decode bytes in position: unexpected end of data" in pexpect during make.

This was a bug - it should be fixed in pexpect 3.0, coming soon. — Thomas K, Oct 28 '13 at 01:05

score 0 · Answer 1 · answered Sep 13 '13 at 13:05

0

What's happening is that pexpect fails to process bytes of a Unicode code point that span different buffers; in your example, the \xe2\x80\x9d can't be decoded because the \x9d byte is missing when the chunk size is a multiple of 8.

Unfortunately I'm not that familiar with pexpect to know how to solve this, but I can imagine two ways:

Try setting maxread to 1 (unbuffered), or
(This is dirty) Catch the exception, buffer the output, and process it along with the next output window.
If you are processing buffers of known size, set maxread to the buffer size.

answered Sep 13 '13 at 13:05

Michael Foukarakis

39,737
6
87
123

Regarding point 1: The problem with unbuffered ouput is that decoding with utf-8 again going to generate invalid sequence. – user634615 Sep 14 '13 at 14:47
I imagined as such, but I couldn't verify so I posted it anyways. – Michael Foukarakis Sep 14 '13 at 15:57

Pexpect throws unicode decode error when make command is run to compile C libraries

1 Answers1