0

I am trying to extract a jpeg image from a binary text file. I want to extract all data between 0xFF 0xD8 (start of image) and 0xFF 0xD9 (end of image) inclusive. Earlier, I have successfully run the following command to get the desired image.jpg from a single paragraph file received.txt:

sed 's/.*\xFF\xD8/\xFF\xD8/; s/\xFF\xD9.*/\xFF\xD9/' received.txt > image.jpg

But when I tried to run the same operation on a different file, it didn't work. I also tried using

sed -n '/\xFF\xD8/,/\xFF\xD9/p' received.txt > temp.txt
sed 's/.*\xFF\xD8/\xFF\xD8/; s/\xFF\xD9.*/\xFF\xD9/' temp.txt > image.jpg

to remove any lines before or after the matched lines but got no success.

Although the file was too large, I pasted the hex dump of the relevant portion below:

0a 55 57 5d 50 cf ff d8 ff fe ff ff ff d9 df 47 fe e7 c9 3b e9 9b 6b 55 c4 57 9b 98 73 fd 15 f7 77 7e f7 95 dd 55 f7 55 05 cc 55 97 55 dd 62 d1 1f 51 ef f1 ef fb e9 bf ed 5f bf f2 9d 75 af fe 6b fb bf 8f f7 f7 7e ff d3 bf 8e d5 5f df 57 75 fe 77 7b bf d7 af df 5d fb 0a 47 de d5 ff c1 23 9b 20 08 20 65 3c 06 83 11 05 30 50 a0 20 55 20 84 41 04 c2 59 50 89 64 44 44 10 05 20 87 28 1d a9

The hex dump of the desired output in this case is:

ff d8 ff fe ff ff ff d9

Update

While trying to resolve the issue, I found that the sed command removes all the characters before or after a matched pattern upto the non-ASCII character (0x80 - 0xFF) but not go beyond that non-ASCII character. As an example, if we try:

echo 55 57 5d 50 cf 50 65 7f ff d8 ff fe ff ff ff d9 | xxd -r -p | sed 's/.*\xFF\xD8/\xFF\xD8/' > output

The hex dump of the output can be seen as:

xxd output

which is:

55 57 5d 50 cf ff d8 ff fe ff ff ff d9

As can be seen, the characters between the non-ASCII character and matched pattern are removed but the characters before the non-ASCII character are not.


Alternative Solution (not perfect)

I used the following commands to somewhat resolve the problem:

sed 's/\xFF\xD8/\x0A\xFF\xD8/; s/\xFF\xD9/\xFF\xD9\x0A/' received.txt > temp.txt

then run the following command (which will work if there is no new line character (0x0A) somewhere between 0xFF 0xD8 and 0xFF 0xD9):

sed -n '/\xFF\xD8/{/\xFF\xD9/p}' temp.txt > image.jpg

but if image.jpg file is empty (after execution of the above command), then run the following command:

sed -n '/\xFF\xD8/,/\xFF\xD9/p' temp.txt > image.jpg

These commands will do the desired job except that it puts 0x0A at the end of the image.jpg file (i.e., after 0xFF 0xD9). In my case, it did not create any issue as JPEG file automatically discards data after 0xFF 0xD9 marker.

I was stuck at the implementation of 'if image file is empty' condition when @chaos came up with a perfect solution. So, I am now following his solution. Thanks a lot @chaos!

Please follow the link below for chaos solution! https://unix.stackexchange.com/questions/231289/extract-data-between-two-matched-patterns-in-a-binary-file


Notes:

Here is how you can get the actual data from its hex dump which you can pipe to sed command:

echo 0a 55 57 5d 50 cf ff d8 ff fe ff ff ff d9 df 47 fe e7 c9 3b e9 9b 6b 55 c4 57 9b 98 73 fd 15 f7 77 7e f7 95 dd 55 f7 55 05 cc 55 97 55 dd 62 d1 1f 51 ef f1 ef fb e9 bf ed 5f bf f2 9d 75 af fe 6b fb bf 8f f7 f7 7e ff d3 bf 8e d5 5f df 57 75 fe 77 7b bf d7 af df 5d fb 0a 47 de d5 ff c1 23 9b 20 08 20 65 3c 06 83 11 05 30 50 a0 20 55 20 84 41 04 c2 59 50 89 64 44 44 10 05 20 87 28 1d a9 | xxd -r -p

and you can see the hex dump of a file by:

xxd file.txt
Community
  • 1
  • 1
  • `.*\xFF` will get the furthest along `xFF`, not the first. – 123 Sep 21 '15 at 09:34
  • how can you select the good marker SOI (as `\xFF\xD8`) and EOI (as `\xFF\xD9`) without informaiton about this container ? sed is only taking this sequence (at best) but wihout knowing any structure. – NeronLeVelu Sep 21 '15 at 10:43
  • I am extremely sorry, I didn't get your point at all. Can you please elaborate. – Adnan Ashraf Sep 21 '15 at 10:52

0 Answers0