2

I'm creating piece of code to check mp3 files on my server and get result do some of them have false sync or not. In short, I'm loading those files in PHP using fread() function and getting stream in variable. After splitting that stream to get separate streams for id3v1 (not necessary, it's not a subject of sync), id3v2 (main problem) and audio, I have to implement that scheme against id3v2 stream.

According to ID3v2 official documentation:

The only purpose of the 'unsynchronisation scheme' is to make the ID3v2 tag as compatible as possible with existing software. There is no use in 'unsynchronising' tags if the file is only to be processed by new software. Unsynchronisation may only be made with MPEG 2 layer I, II and III and MPEG 2.5 files.

Whenever a false synchronisation is found within the tag, one zeroed byte is inserted after the first false synchronisation byte. The format of a correct sync that should be altered by ID3 encoders is as follows:

%11111111 111xxxxx

And should be replaced with:

%11111111 00000000 111xxxxx

This has the side effect that all $FF 00 combinations have to be altered, so they won't be affected by the decoding process. Therefore all the $FF 00 combinations have to be replaced with the $FF 00 00 combination during the unsynchronisation.

To indicate usage of the unsynchronisation, the first bit in 'ID3 flags' should be set (note: I've found that bit). This bit should only be set if the tag contains a, now corrected, false synchronisation. The bit should only be clear if the tag does not contain any false synchronisations.

Do bear in mind, that if a compression scheme is used by the encoder, the unsynchronisation scheme should be applied afterwards. When decoding a compressed, 'unsynchronised' file, the 'unsynchronisation scheme' should be parsed first, decompression afterwards.

My questions are:

  1. How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx?
  2. Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx?

...using preg_replace().

Code I've created so far works perfectly and I have just one line more (well, two exactly).

<?php

  // some basic checkings here, such as 'does file exist'
  // and 'is it readable'

  $f = fopen('test.mp3', 'r');

  // ...rest of my code...  

  $pattern1 = '?????'; // pattern from 1st question
  $id3stream = preg_replace($pattern1, 'something1', $id3stream);

  // ...extracting frames...

  $pattern1 = '?????'; // pattern from 2nd question
  $id3stream = preg_replace($pattern2, 'something2', $id3stream);

  // ..do more job...

  fclose($f);

?>

How to make those two lines with preg_replace() function work?

P.S. I know how to do it reading byte after byte in some kind of loop, but I'm sure this is possible using regular expressions (btw, to be honest, I suck in regex).

Let me know If you need more details.


One more thing...

At the moment I'm using this pattern

$pattern0 = '/[\x00].*/';
echo preg_replace($pattern0, '', $input_string);

to cut off part of string starting at first zero-byte until the end. Is that correct way for doing this?


Update

(@mario's answer).

In first couple of tests... this code has returned correct result.

  // print original stream
  printStreamHex($stream_original, 'ORIGINAL STREAM');

  // adding zero pads on unsync scheme
  $stream_1 = preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2", $stream_original);
  printStreamHex($stream_1, 'AFTER ADDING ZEROS');

  // reversing process
  $stream_2 = preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3", $stream_1);
  printStreamHex($stream_2, 'AFTER REMOVING ZEROS');


  echo "Status: <b>" . ($stream_original == $stream_2 ? "OK" : "Failed") . "</b>";

But minutes after, I've found specific case where everything looks like expected result but there are still FFE0+ pairs in the stream.

ORIGINAL STREAM
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

AFTER ADDING ZEROS
+-----------------------------------------------------------------+
| FF  00  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  |
| 00  FA  84  E0  A9  99  1F  39  B5  E1  54  FF  00  E7  ED  B8  |
| B1  3A  36  88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  |
| 1A  FF  00  FF  FF  00  F8  21  F9  2F  FF  00  F7  17  67  EB  |
| 2A  EB  6E  41  82  FF                                          |
+-----------------------------------------------------------------+

AFTER REMOVING ZEROS
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

Status: OK

If stream contains something like FF FF FF FF it will be replaced with FF 00 FF FF 00 FF, but it should be FF 00 FF 00 FF 00 FF. That FF FF pair will false mp3 synchronisation again so my mission is to avoid every FFE0+ pattern before audio stream (in ID3v2 tag-stream; because mp3 starts with FFE0+ byte-pair and it should be first occurrence at the beginning of audio data). I figured out that I can loop same regex until I got stream without FFE0+ byte-pair. Is there any solution that doesn't require loop?

Great job @mario, thanks a lot!

Community
  • 1
  • 1
Wh1T3h4Ck5
  • 8,399
  • 9
  • 59
  • 79
  • 1
    Only ID3v2.2 and ID3v2.3 may be unsynchronised over the whole tag. ID3v2.4 defines that per frame, which means you cannot wildly de-unsynchronise as per matching occurance - see https://id3.org/id3v2.4.0-changes §3. – AmigoJack Aug 02 '20 at 22:43
  • @AmigoJack Yes, you're right. Back in time, when I was working with ID3v2 tags I had 2 million+ base of audio files and all of them had none, ID3v1 or ID3v2.3 tags so I actually never needed to get into v2.4 tag. But nice you mentioned that difference, might be helpful to other users. – Wh1T3h4Ck5 Aug 05 '20 at 15:04

1 Answers1

1

Binary strings are not quite the turf of regular expressions. But you already had the right approach with using \x00.

3.. to cut off part of string starting at first zero-byte until the end

$pattern0 = '/[\\x00].*$/';

You were just missing the $ here.

1.. How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx?

Use the the sequence FF and E0 for these bit-strings.

preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2");

Using the $2 here in the replacement string, since you search for a variable byte. Otherwise a simpler str_replace would work.

2.. Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx?

Same trick.

preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3");

I would only watch out to always use the \ double backslash, so it is PCRE which interpretets the \x00 hex sequences, not the PHP parser. (It would end up becoming a C string terminator before it reaches libpcre.)

mario
  • 144,265
  • 20
  • 237
  • 291
  • +1 @mario, thank you... I'll try this later and let you know does it really help. About first regex, what that dollar-sign changes in my case? I mean, it also works without that `$`. – Wh1T3h4Ck5 Apr 19 '11 at 08:05
  • The `$` dollar sign looks for the end of the string. Now that I think of it, you probably wouldn't need it. You're just looking for the first NUL and .* is greedy per default. – mario Apr 19 '11 at 08:07
  • @mario, if I understood this in right way... that part `\\xE0-\\xFF` means range between E0 and FF (hex); in fact that's equal to this bitmask %111xxxxx, right? – Wh1T3h4Ck5 Apr 19 '11 at 08:09
  • Right, that's just the hexadecimal version of 111xxxxx. I've cheated with `print base_convert("11100000", 2, 16);` telling me that it is `E0`. – mario Apr 19 '11 at 08:11
  • Thanx, I'm at workplace right now, so I'll try it when return back to home. Seems correct to me, even I don't know so much about regex (just couple of basic things). But this will save a lot of my time (trying to loop through all bytes in stream, evaluating them, comparing, inserting zero-pads, removing it later, etc). – Wh1T3h4Ck5 Apr 19 '11 at 08:19
  • Seems a pretty interesting use case. Would like to know if that works. (Against false positives in the rest of that binary data, the fourth `,1` parameter for preg_replace could help.) – mario Apr 19 '11 at 08:22
  • Sorry, I don't understand that part about `,1`. – Wh1T3h4Ck5 Apr 19 '11 at 08:25
  • See the manual page on [`preg_replace($pattern, $repl, $subject, 1)`](http://www.php.net/manual/en/function.preg-replace.php) - it ensures that only one found match is replaced - instead of all. Might be useful here. – mario Apr 19 '11 at 08:46
  • Wow that's a cool printout. -- There might be another option for converting consecutive FF FF occurences, but your repeated run of preg_replace is probably more reliable. -- Anyway, fascinating use case. Thanks for sharing! – mario Apr 19 '11 at 23:45