Sorting Binary File Index Based

Question

I have a binary file which can be seen as a concatenation of different sub-file:

INPUT FILE:

Hex Offset     ID           SortIndex
0000000        SubFile#1    3
0000AAA        SubFile#2    1
0000BBB        SubFile#3    2
...
FFFFFFF        SubFile#N    N

These are the information i have about each SubFile:

Starting Offset
Lenght in bytes
Final sequence Order

What's the fastest way to produce a Sorted Output File in your opinion ?

For instance OUTPUT FILE will contain the SubFile in the following order:

SubFile#2    
SubFile#3    
SubFile#1    
...

I have thought about:

Split the Input File extracting each Subfile to disk, then concatenate them in the correct order
Using FileSeek to move around the file and adding each SubFile to a BinaryWriter Stream.

Consider the following information also:

Input file can be really huge (200MB~1GB)
For those who knows, i am speaking about IBM AFP Files.

Both my solution are easy to implement, but looks really not performing in my opinion.

Thanks in advance

I have edited the question to make it more clear i hope. I just want to sort the SubFiles by the SortIndex Column for instance (The sorting criteria is no really important cause i am able to identify each SubFile, so i would be able to apply any sorting criteria i want to, like reverse for example) — Duncan_McCloud, Oct 29 '11 at 16:04

Salvatore Previti · Accepted Answer · 2011-10-29T16:48:02.577

Also if file is big the number of IDs is not so huge.

You can just get all you IDs,sortindex,offset,length in RAM, then sort in RAM with a simple quicksort, when you finish, you rewrite the entire file in the order you have in your sorted array. I expect this to be faster than other methods. So... let's make some pseudocode.

public struct FileItem : IComparable<FileItem>
{
    public String Id;
    public int SortIndex;
    public uint Offset;
    public uint Length;

    public int CompareTo(FileItem other) { return this.SortIndex.CompareTo(other.SortIndex); }
}

public static FileItem[] LoadAndSortFileItems(FILE inputFile)
{
    FileItem[] result = // fill the array

    Array.Sort(result);
}

public static void WriteFileItems(FileItem[] items, FILE inputfile, FILE outputFile)
{
    foreach (FileItem item in items)
    {
        Copy from inputFile[item.Offset .. item.Length] to outputFile.
    }
}

The number of read operations is linear, O(n), but seeking is required. The only performance problem about seeking is cache miss by hard drive cache. Modern hard drive have a big cache from 8 to 32 megabytes, seeking a big file in random order means cache miss, but i would not worry too much, because the amount of time spent in copying files, i guess, is greater than the amount of time required by seek.

If you are using a solid state disk instead seeking time is 0 :)

Writing the output file however is O(n) and sequential, and this is a very good thing since you will be totally cache friendly. You can ensure better time if you preallocate the size of the file before starting to write it.

 FileStream myFileStream = ...
 myFileStream.SetLength(predictedTotalSizeOfFile);

Sorting FileItem structures in RAM is O(n log n) but also with 100000 items it will be fast and will use a little amount of memory.

The copy is the slowest part, use 256 kilobyte .. 2 megabyte for block copy, to ensure that copying big chunks of file A to file B will be fast, however you can adjust the amount of block copy memory doing some tests, always keeping in mind that every machine is different.

It is not useful to try a multithreaded approach, it will just slow down the copy.

It is obvious, but, if you copy from drive C: to drive D:, for example, it will be faster (of course, not partitions but two different serial ata drives).

Consider also that you need seek, or in reading or in writing, at some point, you will need to seek. Also if you split the original file in several smaller file, you will make the OS seek the smaller files, and this doesn't make sense, it will be messy and slower and probably also more difficult to code. Consider also that if files are fragmented the OS will seek by itself, and that is out of your control.

"Copy from inputFile" is the interesting part! If i got it right i will have no other choice than use FileSeek to move inside the original Input File right ? — Duncan_McCloud, Oct 29 '11 at 16:12

hage · Answer 2 · 2011-10-29T16:16:50.880

1

The first solution I thought of was to read the input file sequentially and build a Subfile-object for every subfile. These objects will be put into b+tree as soon as they are created. The tree will order the subfiles by their SortIndex. A good b-tree implementation will have linked child nodes which enables you to iterate over the subfiles in the correct order and write them into the output file

another way could be to use random access files. you can load all SortIndexes and offsets. then sort them and write the output file in the sorted way. in this case all depends on how random access files work. in this case all depends on the random access file reader implementation. if it just reads the file until a specified position it would not be very performant.. honestly, I have no idea how they work... :(

edited Oct 29 '11 at 16:16

answered Oct 29 '11 at 16:06

hage

5,966
3
32
42

If i got it right, all SubFile-Objects will be stored in memory. Let's make a practical example: 10xSubFile (Each 100MB) --> 1GB Memory taken. Isn't this too much ? – Duncan_McCloud Oct 29 '11 at 16:09
there is also the possibility to use a disk based tree which may save the nodes and leafs on the hdd... all depends on the quality of the b tree implementation... – hage Oct 29 '11 at 16:13

Sorting Binary File Index Based

2 Answers2