3

Being very new to shell scripts, I have pieced together the following to search /dev/sdd1, sector by sector, to find a string. How do I get the sector data into the $HAYSTACK variable?

#!/bin/bash

HAYSTACK=""
START_SEARCH=$1
NEEDLE=$2
START_SECTOR=2048
END_SECTOR=226512895+1
SECTOR_NUMBER=$((START_SEARCH + START_SECTOR))
while [  $SECTOR_NUMBER -lt $END_SECTOR ]; do
    $HAYSTACK=`dd if=/dev/sdd1 skip=$SECTOR_NUMBER count=1 bs=512`
    if [[ "$HAYSTACK" =~ "$NEEDLE" ]]; then
        echo "Match found at sector $SECTOR_NUMBER"
        break
    fi
    let SECTOR_NUMBER=SECTOR_NUMBER+1 
done

Update

The intention is not to make a perfect script to handle fragmented file scenarios (I doubt that is possible at all).

In my case not being able to distinguish stings with nulls is also a non-issue.

If you could expand the pipe suggestions into an answer it would be more than enough. Thanks!

Background

I have managed to wipe my www folder and have been trying to recover as much of my source files as possible. I have used Scalpel to recover my php and html files. But the version I could get working on my Ubuntu 16.04 is Version 1.60 which does not support regex in header/footer so I cannot make a good pattern for css, js, and json files.

I remember fairly rare strings to search for and find my files, but have no idea where in a block the string could be. The solution I came up with is this shell script to read blocks from the partition and look for the substring and if a match is found print out the LSB number and exit.

jww
  • 97,681
  • 90
  • 411
  • 885
  • Starting a separate copy of `dd` per sector is going to add up to a lot of performance overhead. Is there a reason you want to do it that way? – Charles Duffy Oct 27 '17 at 11:38
  • 2
    Personally, I would tend to use a non-shell language here -- bash supports only C strings, and doesn't have a native type that's able to represent literal NUL values (without some hackery, such as (ab)using arrays for the purpose). The `[[ $value =~ $re ]]` approach will never be able to distinguish between `needle` and `needle`. – Charles Duffy Oct 27 '17 at 11:39
  • @CharlesDuffy, I am clueless how reuse a copy of dd. Is it doable? – Saiid Fouladpour Oct 27 '17 at 11:42
  • 2
    If the $NEEDLE crosses the block boundary, this approach will not find it. – jurez Oct 27 '17 at 11:43
  • If you just read 512 or 2048 bytes at a time, then you can read as many sectors as you want from a single stream created by just one copy of `dd`. The bigger problem, again, is dealing with the lack of language/library support for strings containing NUL literals -- though I suppose for your purpose you could just `dd | tr '\0' '*'` or somesuch, and have your NULs show up as a different, less problematic character. – Charles Duffy Oct 27 '17 at 11:44
  • ...so, if your `dd | tr` pipeline is writing to FD 3, `IFS= read -r -d '' -N 2048 sector <&3` will read 2048 bytes from it into the shell variable `sector`. – Charles Duffy Oct 27 '17 at 11:47
  • For a quick and dirty approach, you could use something like `dd | strings | grep "$NEEDLE" && echo "Found it"` – jurez Oct 27 '17 at 11:48
  • @jurez, once the script works, I will increase the size of the block to be sure I will get the needle in the block. The size of files is not more than a couple kbs. – Saiid Fouladpour Oct 27 '17 at 11:48
  • @SaiidFouladpour Even if you increase the size of the block, this will still not guarantee you will find it. The blocks might not be sequential, for example if file was fragmented. – jurez Oct 27 '17 at 11:51
  • 3
    BTW, you need to change `$HAYSTACK=...` to `HAYSTACK=...`; this is a class of bug http://shellcheck.net/ will identify automatically. – Charles Duffy Oct 27 '17 at 12:08
  • 1
    (...and as an aside -- all-caps variable names are used for variables with meaning to the OS or shell; using lower-case names for your own variables is guaranteed to avoid overwriting something with meaning to POSIX-defined tools by mistake. See http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html, fourth paragraph). – Charles Duffy Oct 27 '17 at 12:14
  • @CharlesDuffy could you elaborate the method with `tr` pipe please? – Saiid Fouladpour Oct 27 '17 at 12:14
  • 1
    Not a great time to write up a full answer right now -- my attention is on another window -- but to summarize in a *bit* more detail: `exec 3< <(dd if=/dev/sdd1 | tr '\0' '@'); sector_count=0; while IFS= read -r -N 2048 -d '' sector <&3; do (( ${#sector} == 2048 )) || { echo "Got a short read (${#sector} bytes) at sector $sector_count; aborting!"; exit 1; }; [[ $sector =~ "$needle" ]] && echo "Found needle at $sector_count"; (( ++sector_count )); done` – Charles Duffy Oct 27 '17 at 12:22
  • ...of course, you'll want to use a filler character that isn't part of your needle. Amend appropriately. – Charles Duffy Oct 27 '17 at 12:23
  • Oh -- should have a `|| [[ $sector ]]` on the `read` above. So `while IFS= read -r -N 2048 -d '' sector <&3 || [[ $sector ]]` – Charles Duffy Oct 27 '17 at 13:40
  • 1
    @CharlesDuffy there are no line breaks in comments and that makes code hard to read. Would it not be better to post the code as an answer and later improve upon it if you feel like it? – Saiid Fouladpour Oct 27 '17 at 13:48
  • I don't approve of using bash for this purpose (even Python would be a better fit; Go or Julia even moreso, having competitive terseness but far better performance), and so don't intend to have my name on an answer. – Charles Duffy Oct 27 '17 at 13:56
  • @SaiidFouladpour, why not use a data recovery software for this? – Tarun Lalwani Oct 29 '17 at 17:06
  • Did you try [extundelete](https://unix.stackexchange.com/questions/122305/undelete-a-just-deleted-file-on-ext4-with-extundelete) on a cloned image or [ext3grep](http://manpages.ubuntu.com/manpages/zesty/man8/ext3grep.8.html)? – yacc Oct 30 '17 at 05:11

2 Answers2

2
  1. If the searched for item is a text string, consider using the -t option of the strings command to print the offset of where the string is found. Since strings doesn't care where the data is from, it works on files, block devices, and piped input from dd.

    Example from the start of a hard disk:

    sudo strings -t d /dev/sda | head -5
    

    Output:

        165 ZRr=
        286 `|f 
        295 \|f1
        392 GRUB 
        398 Geom
    

    Instead of head that could be piped to grep -m 1 GRUB, which would output only the first line with "GRUB":

    sudo strings -t d /dev/sda | grep -m 1 GRUB
    

    Output:

        392 GRUB 
    

    From there, bash can do quite a lot. This code finds the first 5 instances of "GRUB" on my boot partition /dev/sda7:

    s=GRUB ; sudo strings -t d /dev/sda7 | grep "$s" | 
    while read a b ; do
        n=${b%%${s}*}
        printf "String %-10.10s found %3i bytes into sector %i\n" \
             "\"${b#${n}}\"" $(( (a % 512) + ${#n} )) $((a/512 + 1)) 
    done | head -5
    

    Output (the sector numbers here are relative to the start of the partition):

    String "GRUB Boot found   7 bytes into sector 17074
    String "GRUB."    found 548 bytes into sector 25702
    String "GRUB."    found 317 bytes into sector 25873
    String "GRUBLAYO" found 269 bytes into sector 25972
    String "GRUB"     found 392 bytes into sector 26457
    

    Things to watch out for:

    • Don't do dd-based single-block searches with strings as it would fail if the string spanned two blocks. Use strings to get the offset first, then convert that offset to blocks, (or sectors).

    • strings -t d can return big strings, and the "needle" might be several bytes into a string, in which case the offset would be the start of the big string, rather than the grep string (or "needle"). The above bash code allows for that and uses the $n to calculate a corrected offset.

  2. Lazy all-in-one util rafind2 method. Example, search for the first instance of "GRUB" on /dev/sda7 as before:

    sudo rafind2 -Xs GRUB /dev/sda7 | head -7
    

    Output:

    0x856207
    - offset -   0 1  2 3  4 5  6 7  8 9  A B  C D  E F  0123456789ABCDEF
    0x00856207  4752 5542 2042 6f6f 7420 4d65 6e75 006e  GRUB Boot Menu.n
    0x00856217  6f20 666f 6e74 206c 6f61 6465 6400 6963  o font loaded.ic
    0x00856227  6f6e 732f 0069 636f 6e64 6972 0025 733a  ons/.icondir.%s:
    0x00856237  2564 3a25 6420 6578 7072 6573 7369 6f6e  %d:%d expression
    0x00856247  2065 7870 6563 7465 6420 696e 2074        expected in t 
    

    With some bash and sed that output can be reworked into the same format as the strings output:

    s=GRUB ; sudo rafind2 -Xs "$s" /dev/sda7 | 
    sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | 
    sed -r -n 'h;n;n;s/.{52}//;H;n;n;n;n;g;s/\n//p' | 
    while read a b ; do
       printf "String %-10.10s\" found %3i bytes into sector %i\n"  \
              "\"${b}" $((a%512)) $((a/512 + 1)) 
    done | head -5
    

    The first sed instance is borrowed from jfs' answer to "Program that passes STDIN to STDOUT with color codes stripped?", since the rafind2 outputs non-text color codes.

    Output:

    String "GRUB Boot" found   7 bytes into sector 17074
    String "GRUB....L" found  36 bytes into sector 25703
    String "GRUB...LI" found 317 bytes into sector 25873
    String "GRUBLAYO." found 269 bytes into sector 25972
    String "GRUB .Geo" found 392 bytes into sector 26457
    
agc
  • 7,973
  • 2
  • 29
  • 50
  • `array=( $(some-command) )` is bad form -- see [BashPitfalls #50](http://mywiki.wooledge.org/BashPitfalls#hosts.3D.28_.24.28aws_....29_.29). Otherwise, though, this is a good answer. – Charles Duffy Oct 29 '17 at 18:04
  • @CharlesDuffy, Thanks, even without the bad form, that's bad enough since if there's a space, the array has more than two items. – agc Oct 29 '17 at 22:58
  • Let's say our needle is rare enough and we do not end up with lots of matches, but its first occurrence is somewhere in the middle of a 100 GB partition. What is happening under the hood? Does it internally use batches and garbage collect, or do we need an ocean of memory for it to work? – Majid Fouladpour Oct 30 '17 at 04:21
  • @MajidFouladpour, No, this technique should run even on a system with very little memory. All the hard work is done by the util `strings`, it's robust, not a memory hog, and it goes about as fast as can be -- which is maybe about a third of the speed of `dd`. The `grep` part requires the least resources, since maybe only about *1/10* of the average disk is going to be text strings. – agc Oct 30 '17 at 05:02
  • @MajidFouladpour, Did a little testing, a 100GB partition should require under an hour to scan entirely, perhaps less if the HDD is newer. – agc Oct 30 '17 at 05:22
  • OK, I am doing a little test. Here's what I did: ran `echo 'someveryrarestringonlyfoundonceandnomore' > myfile.txt` and then looked up the block number with `sudo hdparm --fibmap myfile.txt` which was `205868656 205868663`. I then ran `sudo strings -t d /dev/sda | grep -m 1 someveryrarestringonlyfoundonceandnomore`. It has been busy for 20 minutes now, but no memory hug. (I will repeat the test with `time !!`). A question: If `strings` is *a third the speed of `dd`*, why aren't we using `dd`? – Majid Fouladpour Oct 30 '17 at 05:23
  • @MajidFouladpour, `dd` outputs binary data, but `grep` needs text input, so there needs to be something in between. Suggest changing test to `s=someveryrarestringonlyfoundonceandnomore ; sudo strings -t d /dev/sda | grep "$s" | while read a b ; do n=${b%%${s}*}; printf "String %-10.10s found %3i bytes into sector %i\n" "\"${b#${n}}\"" $(( (a % 512) + ${#n} )) $((a/512 + 1)) ; done | head -1` for more readable output. – agc Oct 30 '17 at 05:34
  • Triple performance boost is very desirable. So, let's do a thought experiment. I imagine `strings` is reading binary data off of the block in batches each time incrementing an offset, then it tries to identify what `text` there is in the data, then it compiles a list with sector numbers and the associated texts which is then piped to `grep`. Is it not possible to convert the *needle* to binary data only once and compare against it with `dd`? Even expecting the user to manually do the conversion beforehand would not be asking for too much given the boost. – Majid Fouladpour Oct 30 '17 at 05:45
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/157773/discussion-between-agc-and-majid-fouladpour). – agc Oct 30 '17 at 05:48
1

Have you thought about some like this

cat /dev/sdd1 | od -cv | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/  F   l/  F   l/'g  > v1
cat /dev/sdd1 | od -cv | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/  F   l/x F   l/'g  > v2
cmp -lb v1 v2

for example applying this to a .pdf file

od -cv phase-2-guidance.pdf  | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/  F   l/  F   l/'g  > v1
od -cv phase-2-guidance.pdf  | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/  F   l/  x   l/'g  > v2
cmp -l v1 v2

gives the output

   228 106 F    170 x
 23525 106 F    170 x
 37737 106 F    170 x
 48787 106 F    170 x
 52577 106 F    170 x
 56833 106 F    170 x
 57869 106 F    170 x
118322 106 F    170 x
119342 106 F    170 x

where numbers in first column will be the byte offsets where the pattern being sought starts. These byte offsets are multiplied by four since od uses four bytes for every byte.

A single line form (in a bash shell), without writing large temporary files, would be

od -cv phase-2-guidance.pdf  | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/  F   l/  x   l/'g | cmp -lb - <(od -cv phase-2-guidance.pdf  | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/  F   l/  F   l/'g )

this avoids needing to write the contents of /dev/sdd1 to temporary files somewhere.

Here is an example looking for PDF on a USB drive device and dividing by 4 and 512 to get block numbers

dd if=/dev/disk5s1 bs=512 count=100000 | od -cv | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n'  | cmp -lb - <(dd if=/dev/disk5s1 bs=512 count=100000 | od -cv | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/P   D   F/x   D   F/'g ) | awk '{print int($1/512/4)}' | head -10

testing this gives

100000+0 records in
100000+0 records out
51200000 bytes transferred in 18.784280 secs (2725683 bytes/sec)
100000+0 records in
100000+0 records out
51200000 bytes transferred in 40.915697 secs (1251353 bytes/sec)
cmp: EOF on -
28913
32370
32425
33885
35097
35224
37177
38522
39981
41570

where numbers are 512 byte block numbers. Checking gives

dd if=/dev/disk5s1 bs=512 skip=35224 count=1 | od -vc | grep P

0000340   \0  \0  \0 001   P   D   F       C   A   R   O  \0  \0  \0  \0

Here is what an actual full example looks like with a disk and looking for character sequence live and where characters are separated by NUL

   dd if=/dev/disk5s1 bs=512 count=100000 | od -cv | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/l  \\0   i  \\0   v  \\0   e/x  \\0   i  \\0   v  \\0   e/'g | cmp -lb - <(dd if=/dev/disk5s1 bs=512 count=100000 | od -cv | sed s'/[0-9]* \(.*\)/\1/' | tr -d '\n' | sed s'/l  \\0   i  \\0   v  \\0   e/l  \\0   i  \\0   v  \\0   e/'g )

Note

  • this would not deal with fragmentation into non-consecutive blocks where that splits the pattern. The second sed, which does pattern and substitution, could be replaced by a custom program that does some partial pattern match and makes a substitution if number of matching characters is above some level. That might return false positives, but is probably the only way to deal with fragmentation.
Chris Hill
  • 235
  • 1
  • 8