Regex / Python3 - re.findall() - Find all occurrences between opcodes

Question

Background

I'm reverse engineering a TCP stream that uses a Type-Length-Value approach to encoding data.

Example:

TCP Payload: b'0000001f001270622e416374696f6e4e6f74696679425243080310840718880e20901c'
---------------------------------------------------------------------------------------
Type:     00 00   # New function call
Length:   00 1f   # Length of Value (Length of Function + Function + Data)
Value:    00 12   # Length of Function
Value:    70 62 2e 41 63 74 69 6f 6e 4e 6f 74 69 66 79 42 52 43   # Function ->(hex2ascii)-> pb.ActionNotifyBRC
Value:    08 03 10 84 07 18 88 0e 20 90 1c   # Data

However the Data is a data object that can include multiple variables with variable data lengths.

Data: 08 05 10 04 10 64 18 c8 01 20 ef 0f
----------------------------------------------
Opcode : Value
  08   :  05          # var1 : 1 byte
  10   :  04          # var2 : 1 byte
  18   :  c8 01       # var3 : 1-10 bytes
  20   :  ef 0f       # var4 : 1-10 bytes

Currently I am parsing the Data using the following Python3 code:

############################### NOTES ###############################
# Opcodes sometimes rotate starting positions but the general order is always held:
#     Data:     20 ef 0f 08 05 10 04 10 64 18 c8 01
#####################################################################

import re
import binascii

def dataVariable(data, start, end):
    p = re.compile(start + b'(.*?)' + end)
    return p.findall(data + data)

data = bytearray.fromhex('08051004106418c80120ef0f')
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
    print(binascii.hexlify(item), end=' ')

----------------------------------------------------------------------------
[Output]: Variable 3: b'c801'

So far all good...

Problem

If an Opcode appears in the previous variables Value the code is no longer reliable.

Data: 08 05 10 04 10 64 18 c8 20 01 20 ef 0f
----------------------------------------------
Opcode : Value
  08   :  05          
  10   :  04          
  18   :  c8 20 01        # The Value includes the next opcode (20)  
  20   :  ef 0f
----------------------------------------------------------------------------
[Output]: Variable 3: b'c8'
[Output]: Variable 4: b'0120ef0f'

I was expecting an output of:

[Output]: Variable 3: b'c8' b'c82001'
[Output]: Variable 4: b'0120ef0f' b'ef0f'

It seems like there is an issue with my regular expression?

Update

To further clarify, var3 and var4 are representing integers. I have managed to figure out how the length of the Value was being encoded. The most significant bit was being used as a flag to inform me that another byte was coming. You can then strip the MSB of each byte, swap the endianness and convert to decimal.

  data   ->   binary representation    -> strip MSB and swap endianness -> decimal representation

ac d7 05 -> 10101100 11010111 00000101 ->   0001 01101011 10101100      ->   93100
e4 a6 04 -> 11100100 10100110 00000100 ->   0001 00010011 01100100      ->   70500
90 e1 02 -> 10010000 11100001 00000010 ->        10110000 10010000      ->   45200
dc 24    ->          11011100 00100100 ->        00010010 01011100      ->   4700
f0 60    ->          11110000 01100000 ->        00110000 01110000      ->   12400

Can these values be overlapping? Like `18 56 18 20 57 20` with the result as `[b'56182057', b'5618', b'2057']`? — Wiktor Stribiżew, Sep 06 '18 at 08:47

score 0 · Accepted Answer · answered Sep 06 '18 at 11:44

You may use

def dataVariable(data, start, end):
    p = re.compile(b'(?=(' + start + b'.*' + end + b'))')
    res = []
    for x in p.findall(data):
        cur = b''
        for i, m in enumerate([x[i:i+1] for i in range(len(x))]):
            if i == 0:
                continue
            if m == end and cur:
                res.append(cur)
            cur = cur + m
    return res

See the Python demo:

data = bytearray.fromhex('08051004106418c8200120ef0f0f') # => b'c82001' b'c8'
#data = bytearray.fromhex('185618205720') # => b'56182057' b'2057' b'5618' 
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
    print(binascii.hexlify(item), end=' ')

Output is Variable 3: b'c8' b'c82001' for '08051004106418c8200120ef0f0f' string and b'56182057' b'2057' b'5618' for 185618205720 input.

The pattern is of (?=(...)) type to find all overlapping matches. If you do not need the overlapping feature, remove these parts from the regex.

The point here is:

match all substrings starting with start and up to the last end with start + b'.*' + end pattern
iterate through the match dropping the first start byte and add an item to the resulting list when the end byte is found, adding up found bytes at each iteration (thus, getting all inner substrings inside the match).

This works perfectly! Thank heaps! I need to read up more on regex, I havent had much experience. On my drive home today I figured I was disregarding something major. To further clarify, var3 and var4 are representing integers. I already knew how to parse this -> remove the most significant bit and swap the endianness, it works for two's complement as well. Anyway, that MSB I was discarding was actually a flag letting me know another byte was coming. Thats how they were encoding the length. — gop12, Sep 06 '18 at 12:58

Regex / Python3 - re.findall() - Find all occurrences between opcodes

Background

Problem

Update

1 Answers1