I have raw hexdump data. How to match variable length with many different delimiters

Question

Data

$dat[1] = "\x08\xB3\xE3\x0C\x09\x07\x4D\x6F\x68\x61\x6D\x65\x64\x1A";
$dat[2] = "\x08\x84\x03\x09\x03\x53\x6F\x6C\x6C\x1A";
$dat[3] = "\x08\xD4\xEA\x0E\x09\x03\x54\x6F\x6C\x1A";
$dat[4] = "\x08\xD5\x09\x03\x55\x6F\x6C\x1A";
$dat[5] = "\x08\xD4\xEA\x09\x09\x03\x54\x6F\x6C\x1A";
$dat[6] = "\x08\xD4\xEA\xOE\x09\x09\x54\x6F\x6C\x61\x6D\x65\x64\x61\x61\x1A";
$dat[7] = "\x08\xD4\xEA\x09\x09\x09\x54\x6F\x6C\x61\x6D\x65\x64\x61\x61\x1A";

I have raw hexdump data in the pattern above. 08, 09 and 1A are the delimiters. The problem is column D and F can be 09. Is it possible to match regex ? I need data between those delimiters.

My code is not accuracy:

m/\x08(.+?\x09?)\x09(.+?)\x1A/s;

What is your input format? Certainly it's not a picture, as you can't use regex on text in images. Is it an Excel file? Or a CSV file? What is the delimiter for the columns you are showing us? And does it literally say `12` and `1A` in the file? `\x12` is a control character, not the two characters `1` and `2`. Please provide a [mcve] and [edit] your question to tell us what your real input is. If it contains control characters, you can use Data::Dumper to stringify it and show us that. — simbabque, Jan 10 '19 at 10:32
Yes, Column F is the number of bytes. Please refresh again. I just edited. — ต้อง เอกมัย, Jan 10 '19 at 11:05

ikegami · Answer 1 · 2019-01-10T12:31:03.953

I'm assuming the record format is defined as followed:

Each record is made of up of fields that start with a type (e.g. 08, 09, 1A).
Field type 1A is a special type that signals the end of the record.
All records have a field of type 1A.
Field type 08 is followed by a number encoded using this format.
Field type 09 is followed by a single byte that defines the number of bytes in the remainder of the field, which appears to be an ASCII-encoded string. (Another reasonable assumption is that the field type 09 is followed by a single byte that defines the number of Code Points encoded using UTF-8 that follow).
A record may not have two fields of the same type.

I made no assumptions about the following:

If a field of type 08 must be present or not.
If a field of type 09 must be present or not.
The order of the fields.

To parse such records, you can use the following:

for ($file) {  # Makes $_ an alias for $file.
   REC: while (1) {
      my %rec;
      FIELD: while (1) {
         my $field_start = pos() || 0;
         if (!/\G ( . )/sxgc) {
            last REC if !%rec;
            die("Premature EOF\n");
         }

         if ($type eq "\x1A") {
            last;
         }

         elsif ($type eq "\x08") {
            !exists($rec{"09"})
               or warn(sprintf("Duplicate field of type %02X at pos %s\n", $type, $field_start));

            /\G ( [\x80-\xFF]*[\x00-\x7F] ) /sxgc
               or die(sprintf("Bad field of type %02X at pos %s\n", $type, $field_start));

            $rec{"08"} = unpack("w", "$1");
         }

         elsif ($type eq "\x09") {
            !exists($rec{"09"})
               or warn(sprintf("Duplicate field of type %02X at pos %s\n", $type, $field_start));

            /\G ( . ) /sxgc
               or die(sprintf("Bad field of type %02X at pos %s\n", $type, $field_start));

            my $len = ord($1);
            length() >= pos() + $len
               or die(sprintf("Bad field of type %02X at pos %s\n", $type, $field_start));

            $rec{"09"} = substr($_, pos(), $len);
            pos() += $len;
         }

         else {
            die(sprintf("Unrecognized record type %02X at pos %s\n", $type, $field_start));
         }
       }

      # Do something with %rec
   }
}

Yes, very well assumptions. I urgently edit the question several times. — ต้อง เอกมัย, Jan 10 '19 at 11:09
It might be a fun idea to use `unpack 'xwxC/ax'`, but I'm too lazy to check that/work that out. — Corion, Jan 10 '19 at 11:37
To my surprise, `unpack` actually works, but if the input data is as unreliable as the OP shows (and not just typing errors, as some of the stuff suggests), then a regex based approach is much saner. — Corion, Jan 10 '19 at 12:24
@Corion, oh, I didn't realize the number format was standard an supported by `pack`! You've made some additional assumptions (about the order of fields and number of fields of each type), though. — ikegami, Jan 10 '19 at 12:24
Yeah, error checking/reporting is invariably required at some point, so `unpack` might be "good for now", but it's probably just a temporary solution. A proper parser such as the one I posted will eventually be needed. Doesn't necessarily have to be re-based, though — ikegami, Jan 10 '19 at 12:29
I've now added dynamic parsing as well, and error checking and the input data either is really badly typed from Excel or it does not conform to the specification by the OP at all. — Corion, Jan 10 '19 at 12:54
Sorry. the data field 08 are just example since I hurry to edit the question. Some of them are not valid. — ต้อง เอกมัย, Jan 10 '19 at 16:13

Corion · Accepted Answer · 2019-01-10T12:53:44.450

It seems that $dat[4] is invalid data. At least the first field should contain a second byte because D5 indicates that there is at least one more byte following.

$dat[2] is also invalid data because the length field for 0x09 is 0x03, but the field itself contains four characters.

$dat[5] contains an invalid hex escape. Instead of \xEO, I use \xE0.

With these two corrections, you can parse your input messages using the unpack function:

my( $number, $name ) = unpack 'xwxC/ax', $d;

The template for unpack means:

x - throw away this byte (0x08)

w - read a BER-encoded number

x - throw away this byte (0x09)

C - read this byte and use it as the length for the following string

a - read the next bytes and use them as string characters

x - throw away this byte (0x1A)

If you want to keep the field numbers as well, use

    unpack 'CwCC/aC', $d

At least for the data as shown the unpack template works, with the assumptions I've stated. If this is actual ASN.1 data then there should be far more validation etc., and if the field separators might be missing, a regexp-based approach as shown by @ikegami is certainly more robust.

Fixed/dynamic field order

The template relies on a fixed order of the fields. If the field order is not certain to be fixed, you will need to determine the unpack template(s) based on the type of each field in a loop. This brings the unpack approach close to the approach by ikegami.

my ($message_type), $d = unpack 'CA*', $d;
if( $message_type eq "\x08" ) {
    my ($number), $d = unpack 'wA*', $d;
    print "Field 0x08: $number\n";
} elsif ...

See the following complete program for the fixed field order:

#!perl
use strict;
use warnings;

my @dat;

$dat[1] = "\x08\xB3\xE3\x0C\x09\x07\x4D\x6F\x68\x61\x6D\x65\x64\x1A";
#$dat[2] = "\x08\x84\x03\x09\x03\x53\x6F\x6C\x6C\x1A";
$dat[3] = "\x08\xD4\xEA\x0E\x09\x03\x54\x6F\x6C\x1A";
#$dat[4] = "\x08\xD5\x09\x03\x55\x6F\x6C\x1A";
$dat[5] = "\x08\xD4\xEA\x09\x09\x03\x54\x6F\x6C\x1A";
$dat[6] = "\x08\xD4\xEA\x0E\x09\x09\x54\x6F\x6C\x61\x6D\x65\x64\x61\x61\x1A";
$dat[7] = "\x08\xD4\xEA\x09\x09\x09\x54\x6F\x6C\x61\x6D\x65\x64\x61\x61\x1A";

@dat = grep {defined } @dat;

use Data::Dumper;

for my $d (@dat) {
    # Hardcoded message parser
    print Dumper [
        unpack 'CwCC/aC', $d
    ];

    # Dynamic message parser
    while( length $d ) {
        (my ($message_type), $d) = unpack 'aa*', $d;
        if( $message_type eq "\x08" ) {
            (my ($number), $d) = unpack 'wa*', $d;
            print "Field 0x08: $number\n";
        } elsif ( $message_type eq "\x09" ) {
            (my ($len)) = unpack 'C', $d;
            (my ($name), $d) = unpack 'C/aa*', $d;
            print "Field 0x09: $name\n";
        } elsif ( $message_type eq "\x1A" ) {
            # finished
            print "Field 0x1A\n";
        } else {
            die sprintf "Unknown message type %08x", ord($message_type);
        };
    };
};

Output

$VAR1 = [
          8,
          848268,
          9,
          'Mohamed',
          26
        ];
Field 0x08: 848268
Field 0x09: Mohamed
Field 0x1A
$VAR1 = [
          8,
          515,
          9,
          'Sol',
          26
        ];
Field 0x08: 515
Field 0x09: Sol
Field 0x1A
$VAR1 = [
          8,
          1389838,
          9,
          'Tol',
          26
        ];
Field 0x08: 1389838
Field 0x09: Tol
Field 0x1A
$VAR1 = [
          8,
          1389833,
          9,
          'Tol',
          26
        ];
Field 0x08: 1389833
Field 0x09: Tol
Field 0x1A
$VAR1 = [
          8,
          1389838,
          9,
          'Tolamedaa',
          26
        ];
Field 0x08: 1389838
Field 0x09: Tolamedaa
Field 0x1A
$VAR1 = [
          8,
          1389833,
          9,
          'Tolamedaa',
          26
        ];
Field 0x08: 1389833
Field 0x09: Tolamedaa
Field 0x1A

I have raw hexdump data. How to match variable length with many different delimiters

2 Answers2

Fixed/dynamic field order

Output

See also