0

How can I parse sequence of binary digits in python. Following is an example for what i am trying to do.

I have a sequence of binary digits, for example

sequence =  '1110110100110111011011110101100101100'

and, I need to parse this and extract the data.

Say the above sequence contains start, id, data and end fields

start is a 2 bit field, id is an 8 bit field, data field can vary from 1 to 8192 bits and end is a 4 bit field.

and after parsing I'm expecting the output as follows:

result = {start : 11,
          id : 10110100,
          data : 11011101101111010110010,
          end : 1100,
         }

I'm using this in one of my applications. I'm able to parse the sequence using regex but, the problem is regex must be written by the user. So as an alternative i'm using BNF grammar as grammars are more readable.

I tried solving this using python's parsimonious and pyparsing parsers. But am not able to find the solution for the fields with variable length.

The grammar I wrote in parsimonious available for python is as follows:

grammar = """sequence = start id data end

start = ~"[01]{2}"
id = ~"[01]{8}"
data = ~"[01]{1,8192}"
end = ~"[01]{4}"
"""

Since the data field is of variable length, and the parser is greedy, the above sequence is not able to match with the above grammar. The parser takes end field bits into the data field.

I just simplified my problem to above example.

Let me describe the full problem. There are 3 kinds of packets (lets call them Token, Handshake and Data packets). Token and Handshake packets are of a fixed length and Data packet is variable length. (The example above shown is an example for data packet)

The input consists of a continuous stream of bits. Each packet beginning is marked by the "start" pattern and packet end is marked by the "end" pattern. Both of these are fixed bit patterns.

Example Token packet grammar:

start - 2 bits, id - 8 bits, address - 7bits, end - 4bits
111011010011011101100

Example Handshake packet grammar:

start - 2 bits, id - 8bits, end - 4 bits
11101101001100

Example top level rule:

packet = tokenpacket | datapacket | handshakepacket

If there were only one type of packet then slicing would work. But when we start parsing, we do not know which packet we will finally end up matching. This is why I thought of using a grammar as the problem is very similar to language parsing.

Can we make the slicing approach work in this case where we have 3 different packet types to be parsed?

Whats the best way to solve this problem?

Thanks in advance,

sophros
  • 14,672
  • 11
  • 46
  • 75
user2109788
  • 1,266
  • 2
  • 12
  • 29

3 Answers3

2

This will do, just use slicing for this job:

def binParser(data):
    result = {}
    result["start"] = data[:2]
    result["id"] = data[2:8]
    result["end"] = data[-4:]
    result["data"] = data[10:-4]
    return result

You will get the correct data from the string.

Netwave
  • 40,134
  • 6
  • 50
  • 93
  • Thanks for the reply. But, sorry i did not mention in the question that even fields are not fixed. User will enter the field names with the length of the fields as grammar. Also more than one field can have variable length. – user2109788 Feb 18 '14 at 12:11
  • 1
    @user2109788: more variable length fields means the grammar is ambiguous. extend your question. – Karoly Horvath Feb 18 '14 at 12:12
  • I have extended the question. In my problem there will be only one variable length field. I was thinking in general, so i mentioned in the above comment that there can be more variable length fields. So now can you help me to find the answer to the question? – user2109788 Feb 19 '14 at 10:42
  • @user2109788 _I've readed it, but, unless you know or have the data for parsing it you cant parse it, you may have a header or something were it explains you what type of packet you must read. – Netwave Feb 19 '14 at 13:51
2

Presumably, there will only ever be one variable-length field, so you can allow this by defining a distance from the start of the sequence and a distance from the end, e.g.

rules = {'start': (None, 2), 'id': (2, 10), 
         'data': (10, -4), 'end': (-4, None)}

and then use slicing:

sequence =  '1110110100110111011011110101100101100'

result = dict((k, sequence[v[0]:v[1]]) for k, v in rules.items())

This gives:

result == {'id': '10110100', 
           'end': '1100', 
           'data': '11011101101111010110010', 
           'start': '11'}
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
1

Since you mentioned pyparsing in the tags, here is how I would go about it using pyparsing. This uses Daniel Sanchez's binParser for post-processing.

from pyparsing import Word

#Post-processing of the data.
def binParser(m):
    data = m[0]
    return {'start':data[:2],
            'id':data[2:8],
            'end':data[-4:],
            'data':data[10:-4]}
#At least 14 character for the required fields, attaching the processor
bin_sequence = Word('01',min=14).setParseAction(binParser)


sequence =  '1110110100110111011011110101100101100'
print bin_sequence.parseString(sequence)[0]

This could then be used as part of a larger parser.

Eldritch Cheese
  • 1,177
  • 11
  • 21