0

I could use some advice - I'm parsing a binary file in php, to be specific, it's a Sega Genesis rom-file. According to the table I have made, certain bytes correspond to characters or control different stuff with the game's text-engine.

There are bytes, which are used for characters as well as "controller"-bytes, for line-breaks, conditions, color and a bunch of other stuff, so a typical sentence will probably look like this:

FC 03 E7 05 D3 42 79 20 64 6F 69 6E 67 20 73 6F 2C BC BE 08 79 6F 75 20 6A 75 73 74 20 61 63 71 75 69 72 65 64 BC BE 04 61 20 74 65 73 74 61 6D 65 6E 74 20 74 6F 20 79 6F 75 72 BC 73 74 61 74 75 73 20 61 73 20 61 20 77 61 72 72 69 6F 72 21 BD BC

which I can translate to:

<FC><03><E7><05><D3>By doing so,<NL><BE><08>you just acquired<NL><BE><04>a testament to your<NL>status as a warrior!<CURSOR>

I want to specify properties for such a controller-byte-string such as length and write my own values to certain positions..

See, bytes that translate into characters (00 to 7F) or line-breaks (BC) only consist of a single byte while others consist of 2 (BE XX). Conditions (FC) even consist of 5 bytes: FC XX YY (where X and Y refer to offsets which I need to calculate while I put my translated strings together)

I want my parser to recognize such bytes and let me write XX YY dynamicly. Using strtr I can only replace "groups" e.g. when I put the static bytestring into an array.

How would you do this while keeping the parser flexible? Thanks!

Alex
  • 77
  • 7
  • Does this [`FC( \w\w){4}|BE( \w\w)|(\w\w)`](https://regex101.com/r/kR9kdP/1) work? It includes the 3 rules you've mentioned, FC + 4 bytes or BE + 1 byte or just a single byte – degant May 03 '17 at 20:10
  • I'm not good at regex's but I used your expression with preg_match and it gave me an error: preg_match(): Delimiter must not be alphanumeric or backslash. – Alex May 03 '17 at 20:42
  • First check out the demo: https://regex101.com/r/kR9kdP/1 and check if this is what you are looking for and if the matches are working correctly. You can then try using it like this: https://regex101.com/r/kR9kdP/1/codegen?language=php – degant May 03 '17 at 20:44
  • @Alex In PHP you have to put a delimiter around the regular expression: `preg_match('/.../', $string, $matches)` where `...` is your regexp. – Barmar May 03 '17 at 20:50
  • @degant Your regexp is for a text file. He said this is a binary file, those are apparently the hex values of each byte. – Barmar May 03 '17 at 20:51
  • That is really helpful! thanks! Yes, it seems it's working as expected. I need to say there are no spaces between the hex-values, I added them for better readability. I pasted a bigger test-string and deleted the spaces in the expression. looking good! [link](https://regex101.com/r/kR9kdP/2) Now I need to figure out a concept for a function that returns maybe an array of all elements which become multidimensional when an expression is met.. – Alex May 03 '17 at 20:54
  • @Barmar That's no issue - I use the hex-values as string which works for me. – Alex May 03 '17 at 21:07
  • @Alex added an answer showing how to get an array of elements with the required hex characters. Let me know if it works. Thanks! – degant May 03 '17 at 21:22
  • @degant YES, it does, plus it keeps the indexes as found in the string which is pretty awesome for rebuilding the string. *T*H*A*N*K**Y*O*U*! – Alex May 03 '17 at 21:27
  • Yes, added bonus. Plus the indexing works across the three sub-arrays. Glad I could help! – degant May 03 '17 at 21:30

2 Answers2

0

You can put hex characters in a regexp by using \x##, where ## is the hex code for the character. So you can match FC XX YY with:

preg_match('/(?=\xfc).{4}/, $bytes, $match);

$match[0] will then contain the 4 bytes after FC. You could split them up into pairs with capture groups:

preg_match('/(?=\xfc)(..)(..)/, $bytes, $match);

$match[1] will contain XX and $match[2] will contain YY.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Thanks! Could you explain the expression a bit more - the FC consists of five bytes in total, so the next 4 are the interesting ones. How can I tell the expression how many bytes to capture? – Alex May 03 '17 at 21:10
  • The number of bytes to capture is the number of `.` after `\cfc`. I thought `XX` and `YY` were each just one byte. – Barmar May 03 '17 at 21:13
0

Assuming you have your hex values available as string, you can use this regex to parse it like you've mentioned. If you identify more rules other than FC**** or BE** then you can directly add them to the below regex so that they are also extracted.

(?<fc>FC(\w\w){4})|(?<be>BE(\w\w))|(?<any>(\w\w))

Now using named groups fc, be, any to identify result set easily using arrays such as $matches['fc'].

Regex Demo: https://regex101.com/r/kR9kdP/5

$re = '/(?<fc>FC(\w\w){4})|(?P<be>BE(\w\w))|(?P<any>(\w\w))/';
$str = 'FC03E705D3FC0006042842616D20626162612062';

preg_match_all($re, $str, $matches, PREG_PATTERN_ORDER, 0);

// Print the entire match result
print_r(array_filter($matches['fc']));  // Returns an array with all FC****
print_r(array_filter($matches['be']));  // Returns an array with all BE**
print_r(array_filter($matches['any'])); // Returns rest **

PHP Demo: http://ideone.com/qWUaob

Sample Results:

Array
(
    [0] => FC03E705D3
    [1] => FC00060428
)
Array
(
    [50] => BE08
    [59] => BE04
    [113] => BE08
    [132] => BE04
)

Hope this helps!

degant
  • 4,861
  • 1
  • 17
  • 29