Join large overlapping files

Question

I am trying to recover a (MySQL) database from a crashed disk. There are a number of recent dumps, which are corrupted bz2 files. Since the database does not change often, the dumps should be nearly identical. bzip2recover recovered about 70-80% of the chunks from the files, so most if not all of the data could be recovered by finding the overlaps in the files and joining them together. For example:

dump1: |-----------------|xxxxxxxxxxxxxxxx|------------------|
dump2: |-------------|----------------|xxxxxxxxxxxxxxxxxxxxxx|
dump3: |xxxxxxxxxxxxxxxxxxxxxx|---------------|xxxxxxxxxxxxxx|

here I can detect that the first chunk in dump1 is continued by the second one in dump2, which is continued by the second in dump3, which is continued by the third in dump1. By joining these four files, I have recovered the data.

The problem is that there are thousands of files (I have ten dumps of ~400 1M chunks each). Is there a tool which could automate this process, or at least parts of it (like a linux command checking for the longest overlap between the end of one file and the start of another)?

The first question must be: in a given recovered file, how do you know that a recovered chunk is valid? In dump 3 above, you show the data at the beginning of the dump to be known-good even though you have no other recovered copy that shares that data. How do you know that that data range is valid data? — MadHatter, Jan 17 '12 at 08:01
@MadHatter: `bzip2recover` does a CRC32 check before extracting the chunks. The image is somewhat misleading: I don't exactly know the data range, `bzip2` splits the file into roughly (but not exactly) 900K long blocks before compressions, and `bzip2recover` extracts all those blocks, and uncompresses the valid ones. So I know the valid data and their rough location, but I don't know the exact length of the invalid blocks (otherwise this would be much simpler). — Tgr, Jan 17 '12 at 20:55

score 4 · Answer 1 · answered Mar 23 '12 at 19:58

I needed this exact same thing. I came up with this surprisingly fast python code (it joined two 2GB files with an 800MB overlap in 30 seconds.) Adjust overlap_size as necessary for your chunks. It should be as long as possible but less than the real overlap size.

#!/usr/bin/env python

import sys

overlap_size = 100000000 # 100MB

a = file(sys.argv[1]).read()
b = file(sys.argv[2]).read()
end = a[-overlap_size:]
offset = b.find(end)

c = file(sys.argv[3], 'wb')
c.write(a[:-overlap_size])
c.write(b[offset:])
c.close()

Usage:

./join.py chunkA chunkB outputAB
./join.py outputAB chunkC outputABC
./join.py outputABC chunkD outputABCD
...etc

score 1 · Answer 2 · answered Jan 17 '12 at 08:39

I don't have a tool for you to do the job completely, but you can use tools like:

cmp -l dump1 dump2

This will give you the list of different bytes and their offsets. The overlapping is where there is no offset printed by cmp.

Also, you can use dd command to copy part of a dump and append it to another dump.

You can try writing your own script that use such tools or you can write a small C program that compares these files and copy the needed parts.

I hope you find those ideas helpful.

score 1 · Answer 3 · answered Jan 17 '12 at 11:02

like a linux command checking for the longest overlap between the end of one file and the start of another

Traditionally, this would be diff. It will produce the "difference" of two given text files as the output, along with some control information (what has been added, what has been removed, which lines to check). The patch command is able to reverse the process.

In theory, you should be able to use diff on your different chunks, work a bit on its output (like removing the commands for line deletion) and feed it to patch:

# echo 'this
> is
> a' > file1
# echo 'a
> chunked' > file2
# echo 'chunked
> data
> file' > file3

# diff file2 file1 | egrep -v '^>' | patch -p0 -R file1 -o file12
patching file file1

# cat file12
this
is
a
chunked

# diff file3 file12 | egrep -v '^>' | patch -p0 -R file12 -o -
patching file file12
this
is
a
chunked
data
file
#

Note that if you have very large input files, diff will need a vast amount of memory.

Memory is one of the problems; the other is that `diff` is based on lines. MySQL dumps have very long lines, and the chunks are not split at newlines. — Tgr, Jan 17 '12 at 21:01
@tgr If you have long lines or not basically depends on the mysqldump options used when creating the dump. But as you have, diff would not be appropriate indeed. You could take a look at `rdiff` as well - it is based on rolling checksums instead of lines, but as I see it, it could not be easily scripted, so you probably would have to get into the source code to make it work for your purposes. — the-wabbit, Jan 18 '12 at 08:42

score 1 · Answer 4 · answered Jan 17 '12 at 11:47

I think you're just going to have to write such a tool yourself.

Start out with the largest file and copy it into memory as your image.

Then run through all the files, one by one, looking for an overlap with either the first or last chunk of the current memory image. If you find an overlap, extend the memory image.

Repeat until you take a pass through all the files without adding any bytes. Then write the memory image to a file.

Join large overlapping files

4 Answers4