10

I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9)

In the past I've used dd to cut part of binary file from beginning/ending but this command doesn't seem to support what I ask.

What tool on terminal can do this?

Laurent Grégoire
  • 4,006
  • 29
  • 52
theta
  • 24,593
  • 37
  • 119
  • 159

7 Answers7

7

Locate the start/end position, then extract the range.

$ xxd -g0 input.bin | grep -im1 FFD8FFD0  | awk -F: '{print $1}'
0000cb0
$ ^FFD8FFD0^AFFFD9^
0009590
$ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin
kev
  • 155,172
  • 47
  • 273
  • 272
  • I found "..count=$((0x9590-0xcb0+2)) skip=$((0xcb0+1))..." to match exactly starting from "FFD8.." and ending to "AFFF..". Thank you for your nice procedure. Cheers – theta Feb 26 '12 at 10:19
  • 1
    After couple of extractions I noticed that this is only approximate solution. +1, +2 all depend on content. For example `007d820: 74290068656c6c6f2e6a706700ffd8ff` gives 007d820 for both '74 29 00 68' and '00 ff d8 ff' so something slightly different has to be done – theta Feb 26 '12 at 12:19
  • 1
    This *does not work*. If the pattern to match is split on two lines of `xxd` output it will never be found (by default `xxd -g0` group lines per 16 bytes). For a pattern of 4 bytes long the probability to have a split is 25%. Also, the `grep|awk` will print the address of the *beginning* of the line where the pattern occur, so a delta of up to line size can happen, you end up with more data than you really want. – Laurent Grégoire Feb 27 '12 at 07:42
  • @lOranger use `-c 160` option to reduce the probability. – kev Feb 27 '12 at 07:58
  • 1
    We're not talking about *probability* here, but *certainty*! Even with 160 (the max is 256 for xxd), the probability is more than 2%, which is **huge**. If you automate this, you need a script that *works all the time*, not 98% of the times. See my answer below for a proposal that works all the time. – Laurent Grégoire Feb 27 '12 at 08:30
3

In a single pipe:

xxd -c1 -p file |
  awk -v b="ffd8ffd0" -v e="aaffd9" '
    found == 1 {
      print $0
      str = str $0
      if (str == e) {found = 0; exit}
      if (length(str) == length(e)) str = substr(str, 3)}
    found == 0 {
      str = str $0
      if (str == b) {found = 1; print str; str = ""}
      if (length(str) == length(b)) str = substr(str, 3)}
    END{ exit found }' |
  xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file

The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.

The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash's ${PIPESTATUS[1]} where I decided to delete the new file.

Note that en empty file also mean that nothing has been found.

jfg956
  • 16,077
  • 4
  • 26
  • 34
  • Yet another mark reassignment - lOranger' solution fails if 2nd pattern can be found before the 1st - giving $len with negative sign. This solution searches after the 1st pattern match, so it doesn't have such problem, nor generates intermediate triple size file. – theta Feb 28 '12 at 09:46
  • After testing this more, I found it without issues, but it's rather slow on larger files. Does anyone see a place for some optimisation, or this is the best one can get from xxd/awk? – theta Feb 28 '12 at 12:34
  • Try the new `sed` version that I just post. This one can be optimized replacing string concatenation and extraction with rotatory indexes in arrays, but it is less readable; and I do not want to do it if not needed ;-). – jfg956 Feb 28 '12 at 13:03
2

This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the "pattern split across line" issue, also look for the pattern only aligned at byte offset (not nibble).

file=<yourfile>
outfile=<youroutputfile>
startpattern="ff d8 ff d0"
endpattern="af ff d9"
xxd -g0 -c1 -ps ${file} | tr '\n' ' ' > ${file}.hex 
start=$((($(grep -bo "${startpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3))
len=$((($(grep -bo "${endpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3-${start}))
dd ibs=1 count=${len} skip=${start} if=${file} of=${outfile}

Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd directly into the two grep. A one-liner is also possible, at the expense of clarity.

One could also use tee and named pipe to prevent having to store a temporary file and converting output twice, but I'm not sure it would be faster (xxd is fast) and is certainly more complex to write.

Laurent Grégoire
  • 4,006
  • 29
  • 52
  • lOranger, I used -c64 to compensate a bit, and `cut` and `sed` to calculate correct address, but -c1 should be real solution. I'll mark your solution, but when I manage to make it work. First I needed to change place of `grep`'s pattern and filename to make grep work, but regardless I get `dd: invalid number` I imagine problem in start/len calculation/grammar. Also can't we exclude empty space and save 1/3 of output .hex file which would be double the input file size instead triple as it is now? – theta Feb 27 '12 at 10:00
  • Sorry, there was a typo in the script: `grep` pattern should be *before* the filename. I also added a `| head -1` to cover the case where the pattern appears multiple times in the input, which can happen. Concerning your question, the space between hex bytes is necessary, otherwise you have the "nibble" issue (pattern is not aligned on byte boundaries). – Laurent Grégoire Feb 27 '12 at 10:25
  • I'm afraid it still doesn't work. I get input file as result. I used my -c64 script, and get expected dump, but I was unwilling to post it here as it was fragile on boundaries (better than provided, but still..) – theta Feb 27 '12 at 11:18
  • Please note that you have to convert your hex pattern to *lowercase* (or add option `-i` in `grep`). I've just tested the script here with a big binary file and it works fine. Please print the value of ${start} and ${len} to debug (you can check that start and len > 0 to prevent cases where the pattern is not found in the input. – Laurent Grégoire Feb 27 '12 at 12:26
  • Just in case: http://pastebin.com/raw.php?i=hZ5UqAF9 Patterns are in lower case. It simply returns the input file as dump, so start and end position are 0 and input file length. – theta Feb 27 '12 at 12:48
  • Well, I tested your script here and it works fine under a `bash` and `sh` script (provided I change the pattern to match some data in my input file). You have to check obviously that both patterns appears in the input. Which version of various tools are you using? Also please print `${start}` and `${len}` to check what's wrong. Please edit the .hex leftover file and manually check that the patterns are present, just in case... – Laurent Grégoire Feb 27 '12 at 12:56
  • Try it yourself with script from pastebin on this file: http://ge.tt/1EjaXGE/v/0 (160K) – theta Feb 27 '12 at 13:12
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/8261/discussion-between-loranger-and-theta) – Laurent Grégoire Feb 27 '12 at 13:24
1

See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd to get what you need.

Community
  • 1
  • 1
Laurent Grégoire
  • 4,006
  • 29
  • 52
1

A variation on the awk solution that assumes that your binary file, once converted in hex with spaces, fits in memory:

xxd -c1 -p file |
  tr "\n" " " |
  sed -n -e 's/.*\(ff d8 ff d0.*aa ff d9\).*/\1/p' |
  xxd -r -p > new_file
jfg956
  • 16,077
  • 4
  • 26
  • 34
  • WOW, this is so sweet and looks so easy. Couldn't be better than this. I'll leave mark on IOranger's answer as it is correct and answered earlier, but this is by far my favourite snippet – theta Feb 27 '12 at 22:53
  • Too bad the quickest get the mark, not the shortest... Anyway, it can still be optimized by removing the `tr`, replacing it inside `sed` by `-e '1h' -e '2,$H' -e '${x;s/\n/ /g}'` and modifying the above substitution to be performed only on last line. Note that this solution does not work one huge binary files, as the file need to be put in memory in `sed`. On huge files, use the `awk` solution. – jfg956 Feb 28 '12 at 07:27
  • Thanks. I tested this on 1GB laptop, and it was fine for 5MB file, but it made my system inaccessible on 50MB file. Is there maybe some general rule for determining "limit" file size based on available RAM, in your opinion? – theta Feb 28 '12 at 09:49
  • A 50MB file means 150MB once decoded and once bytes are separated by spaces. IT is not that much, but could cause `sed` to behave very slowly: a line of 150MB is a lot ! You could try the `-n` option to `sed` to remove buffering, but it could just worsen the problem. It is difficult to give an opinion on the limit: I do not know about `sed` implementation. The best is to do many tries. Sorry not to be able to help more. – jfg956 Feb 28 '12 at 12:26
  • Thanks. You helped more then enough – theta Feb 28 '12 at 12:32
  • The three sets of wildcards make `sed` do a lot of recursive searching, probably... I think that may be the reason that things slow down when the file gets big. – Floris Jul 07 '17 at 22:57
1

Another solution in sed, but using less memory:

xxd -c1 -p file |
  sed -n -e '1{N;N;N}' -e '/ff\nd8\nff\nd0/{:begin;p;s/.*//;n;bbegin}' -e 'N;D' | 
  sed -n -e '1{N;N}' -e '/aa\nff\nd9/{p;Q1}' -e 'P;N;D' |
  xxd -r -p > new_file
test ${PIPESTATUS[2]} -eq 1 || rm new_file

The 1st sed prints from ff d8 ff d0 till the end of file. Note that you need as much N in -e '1{N;N;N}' as there is bytes in your 1st pattern less one.

The 2nd sed prints from the beginning of the file to aa ff d9. Note again that you need as much N in -e '1{N;N}' as there is bytes in your 2nd pattern less one.

Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.

Note that the Q command is a GNU extension to sed. If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed, but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.

jfg956
  • 16,077
  • 4
  • 26
  • 34
  • I do have this GNU extension to sed, but can't make this script work for some reason – theta Feb 28 '12 at 13:27
  • Sorry, typo in the 2nd `sed`: it should work if you replace `/aa\nff\nd9/` with `/af\nff\nd9/`. – jfg956 Feb 28 '12 at 15:18
  • I don't understand what difference that would make? Please try this sample: http://ge.tt/42cScKE/v/0?c (160K) – theta Feb 28 '12 at 15:44
  • The link is not working :-(. If you do not have any output, it means that those 2 patterns are not found. You can debug the script running the 2 first commands and adding other after. About the change, I think you are looking for data between `ff d8 ff d0` and `af ff d9`, but the script in my solution above is taking data between `ff d8 ff d0` and `aa ff d9`. – jfg956 Feb 28 '12 at 19:13
  • Sorry, link must have expired. I uploaded on other service, please try here: http://hotfile.com/dl/148193223/e90ab68/bin.dat.html Patterns are of course present in file, I checked multiple times – theta Feb 29 '12 at 01:51
  • Ok, there was an error in the final test. I corrected it. The error was also in the awk version that I also corrected. – jfg956 Feb 29 '12 at 06:51
0

You can use binwalk in order to do so. The tool will autodetect the files (an the offsets) in the input binary.

By using the -e flag, it will extract all the files in the same directory in which you are running the command.

It is installed by default in the newest distros but you can easily install the CLI tool with sudo apt install binwalk.

Here is an example of execution where I have hidden a zip file whose content is a text file called pass.txt. The whole thing is hidden in a .jgp image. enter image description here enter image description here

Read the manual for further information.

A.Casanova
  • 555
  • 4
  • 16