0

I am having an issue which I am unable to solve after spending the last 10 hours searching around the internet for an answer.

I have some data in this format

??E?�?0?�?<?20120529184453+0200?20120529184453+0200?�?�?G0E?5?=20111213T103134000-136.225.6.103-30365316-1448169323, ver: 12?�?W??tP?2?�?
??|?????
??:o?????tP?�??B@?????B@????�?�?)0?�???
49471010550?�??	???tP???3??<?�?�?�?�??�?�?�?�?�??�?�?�?�?�

I have a PHP code, not written by me, which is just running html_entity_decode on that and it returns the correct results.

When I try running Perl's decode_entities I get a completely different result. After some debugging it seems to me that PHP is "properly" replacing what seems to be invalid entities, such as, � or  into their ascii counterparts, namely NULL and backspace for the 2 cases mentioned.

Perl on the other hand does not seem to decode those "invalid" entities and leaves them alone which later one screws up the result (Which goes through unpack or, in phph's case, bin2hex, which fails because rather than unpacking null to 00 it will unpack each individual character of �).

I have tried everything I can think of include running the following substitution in perl after running decode_entities

    $var =~ s/&#(\d+);/chr($1)/g

however that does not work at all.

This is driving me mad and I would like to have this done in perl rather than phpI really hope I don't have to write 1000 pattern matching lines in perl to cover all possible entities and numbers.

Anybody that has an idea how to go about this problem without resorting to having to parse PHPs entire html_entity_decode function into perl or writing endless lines of pattern matching?

Alexandre Thenorio
  • 2,288
  • 3
  • 31
  • 50

1 Answers1

2

You're almost there. Instead of

$var =~ s/&#(\d+);/chr($1)/g

say

$var =~ s/&#(\d+);/chr($1)/ge

The /e modifier instructs Perl to 'e'valuate the replacement pattern.

mob
  • 117,087
  • 18
  • 149
  • 283