'UTF-8' decoding error while using unireedsolomon package

Question

I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:

def str_to_byte(padded):
    byte_array = padded.encode()
    binary_int = int.from_bytes(byte_array, "big")
    binary_string = bin(binary_int)
    without_b = binary_string[2:]
    return without_b

def byte_to_str(without_b):
    binary_int = int(without_b, 2)
    byte_number = binary_int.bit_length() + 7 // 8
    binary_array = binary_int.to_bytes(byte_number, "big")
    ascii_text = binary_array.decode()
    padded_char = ascii_text[:]
    return padded_char

After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.

If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?

Mark Tolonen · Accepted Answer · 2022-06-22T23:28:52.663

1

It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.

Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:

import unireedsolomon as rs
import random

def corrupt(encoded):
    '''Flip up to 3 bits (might pick the same bit more than once).
    '''
    b = bytearray(encoded) # convert to writable bytes
    for _ in range(3):
        index = random.randrange(len(b)) # pick random byte
        bit = random.randrange(8)        # pic random bit
        b[index] ^= 1 << bit             # flip it
    return bytes(b) # back to read-only bytes, but not necessary

def encode(coder,msg):
    '''Convert the msg to UTF-8-encoded bytes and encode with "coder".  Return as bytes.
    '''
    return coder.encode(msg.encode('utf8')).encode('latin1')

def decode(coder,encoded):
    '''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
    '''
    return coder.decode(encoded)[0].encode('latin1').decode('utf8')

coder = rs.RSCoder(20,13)
msg = 'hello(你好)'  # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)

Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.

b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)

A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.

edited Jun 22 '22 at 23:28

answered Jun 21 '22 at 00:34

Mark Tolonen

166,664
26
169
251

Hi ! Thank you for your response. I saw that there is an instance when the bit position is 7 in all three cases then I still encounter an 'utf-8' decoding error. In such an exceptional scenario do you think it is okay to use the 'replace' statement in encoding(error =' ') ? – Sid Jun 21 '22 at 08:27
@Sid The only reason that would occur is if the decoder didn't repair the string, and I only saw that occur when changing more than 3 bytes. I ran thousands of loops with the current code and specifically targeting bit 7. Did you run this *exact* code? What was the *encoded* vs. *decoded* string that failed? – Mark Tolonen Jun 21 '22 at 16:19
Yes, I did run the same code. My project involves initial use of pure binary strings. I am sending an image and the color palette represents the values '00', '01', '11', and '10'. Which means one node must send binary data to the other. I used your code and adapted accordingly, for certain scenarios (which is also needed - the erroneous scenario when the error can't be corrected) where I can show I get distorted image. In MATLAB it was easy because there you can send a 1-D array of data which was considered binary automatically by the RS encoder but here its confusing. – Sid Jun 21 '22 at 18:54
@Sid "I used your code and adapted accordingly" doesn't sound "exact". The code above does not error for me. Update with a [mcve] that reproduces your issue. The above works with "pure binary strings". That's what a bytes string (`b'xxx'`) is. – Mark Tolonen Jun 21 '22 at 21:16
Yes, you are absolutely correct that your solution works. But for me, there are scenarios where the error might be more than the correction capability. When that happens and when the bits are flipped at odd locations then I get an 'utf-8' decode error. Hopefully, I am a bit clearer this time. – Sid Jun 21 '22 at 21:28
@Sid You can increase the error handling by making the 1st parameter to RSCoder larger. (40,13) gave me 13 bytes of correction. If working with "pure binary data" why do you need UTF-8 encoding at all? That's just to put text into a binary form. You can use `errors='replace'` if you want to skip any UTF-8 decoding errors. – Mark Tolonen Jun 21 '22 at 21:57
Yes, ofc I can increase the error correction capability but I also need the 'non-correctable' scenarios well functioning. The package seems to work with bytes and I tried changing Galois field and primitive polynomial of the unireedsolomon but it was throwing a lot of errors. So, i thought this is the only option I got. – Sid Jun 22 '22 at 07:05

score 0 · Answer 2 · answered Jun 17 '22 at 10:12

0

UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at

https://en.wikipedia.org/wiki/UTF-8#Encoding

Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.

answered Jun 17 '22 at 10:12

rcgldr

27,407
3
36
61

Thank you for your response but I couldn't quite understand it. Let's say I am sending a message: '101' the RS encoder converts it to '101{ESC}¿Ñ{STS}' for a (7,3). Since I still need to send binary data I tried converting the parities to bytes and then bits to send it in a binary form and when I receive it, I must convert it back to these characters so that the decoder can read it properly. I hope I elaborated it properply. – Sid Jun 17 '22 at 12:41
1

Looking at [github unireedsolomon](https://github.com/lrq3000/unireedsolomon), the library can work with strings or binary data. The issue you're having is with UTF-8, since flipping random bits can result in invalid UTF-8 codes. I suggest converting string to binary bytes, then rs.encode with binary data. At this point you can flip any bits without issue. Then use rs.decode to fix the binary bytes, and then convert binary bytes to string. – rcgldr Jun 17 '22 at 17:02
Can you please give a short example ? It'd be much appreciated. – Sid Jun 18 '22 at 09:24
I'm not that experience with Python. It may take a while for me to do this, but looking at that github repository, it seems it would be better to convert the UTF-8 string to bytes, before encoding, and decoding those bytes before converting back to a UTF-8 string. Again the issue is flipping random bits on an UTF-8 string will result in invalid codes. – rcgldr Jun 21 '22 at 00:09

'UTF-8' decoding error while using unireedsolomon package

2 Answers2