The fastest search algorithm into binary files?

Question

I'm looking for the fastest and the best algorithm to search some values into a very huge binary file (kind of 2 GB AFP file), wich means that loading the whole data in memory must be inconceivable. I'm working with C# and i don't know if any other programing language (C/C++..) would be really much faster, otherwise i'll continue with C#. Thanks for any ideas.

What do you mean with _"values"_? A byte, a byte[], a string, what? — Marco, Dec 01 '11 at 09:21
You say you should search bytes, but what's your goal? Count how many of those bytes you have? Scan file until you find a target byte so you can read some var from there? Please, explain better what are you trying to do. — Marco, Dec 01 '11 at 09:33
ihave to count some sequents occurences into the file (for ex, X'D3A8AF') — Abdelilah Outifaout, Dec 01 '11 at 09:38
this sequence indicates the beginning of a page, so to get the number of pages of this file, i have to count it. — Abdelilah Outifaout, Dec 01 '11 at 09:42

score 2 · Answer 1 · answered Dec 01 '11 at 10:14

Boyer-Moore offers a good compromise between performance and complexity (and the linked articles has links to other methods.

An implementation in C (source code in link) will be significantly faster than C#, although in practice you'll probably find that disk I/o is the biggest hurdle.

Marco · Answer 2 · 2011-12-01T10:18:48.190

1

After commenting, I decided to provide a possible solution.
Be careful: this solution is not the best nor elegant.
Use it as a starting point:

string SEARCH = @"X'D3A8AF";
int BUFFER = 1024;

int tot = 0;
using (FileStream fs = new FileStream(filename, FileMode.Open))
{
    using (StreamReader sr = new StreamReader(fs))
    {
        char[] buffer = new char[BUFFER];
        int pos = 0;
        while (fs.Position < fs.Length)
        {
            sr.ReadBlock(buffer, 0, BUFFER);
            string s = new string(buffer);
            int i = 0;
            do
            {
                i = s.IndexOf(SEARCH, i);
                if (i >= 0) { tot++; i++; }
            }
            while (i >= 0);
            pos += BUFFER;
            if (!s.EndsWith(SEARCH)) pos -= SEARCH.Length;
            fs.Position = pos;
        }
        sr.Close();
    }
    fs.Close();
}

BUFFER could be modified (increased) as you please.

edited Dec 01 '11 at 10:18

answered Dec 01 '11 at 10:13

Marco

56,740
14
129
152

@Outifaout: let me know if it works for you. – Marco Dec 01 '11 at 10:24
1

basically it doesn't work, the string s has a binary format which can't be read and indexed ! – Abdelilah Outifaout Dec 01 '11 at 10:39
@Outifaout: why? Any char[] can be converted to string! – Marco Dec 01 '11 at 10:40
that char[] itself takes unreadable characters (somehing like @ùøñõò..) – Abdelilah Outifaout Dec 01 '11 at 10:45
what i'm doing right now is reading it in hexadecimal format byte by byte : int b = file.ReadByte(); string s = b.ToString("X"); my problem is that i need an optimal way in term of processing time ! – Abdelilah Outifaout Dec 01 '11 at 10:46
@Outifaout: I provided an example... I can't develop all the code for you ;) – Marco Dec 01 '11 at 10:50
Thank you very much Marco, i got the method. i'll try to fix this reading problem. – Abdelilah Outifaout Dec 01 '11 at 10:54

score 0 · Answer 3 · answered Dec 01 '11 at 09:21

0

You have to load entire file to search the object. If possible split the files based on unique id's if you have. Like split a file for each 100 records (1-100, 101-200, 201-300 etc) based on unique id's or some other params. It is kind of indexing your binary file.

answered Dec 01 '11 at 09:21

hungryMind

6,931
4
29
45

No, he can't load entire file IMHO!! OP could use a StreamReader and read file in chunks. It depends on what he's searching for – Marco Dec 01 '11 at 09:22

Ilan Huberman · Answer 4 · 2011-12-01T09:45:35.067

0

You can use TextReader.ReadBlock Method. Read the file block by block and look for the requested values. Or even better use BinaryReader.ReadBytes Method.

edited Dec 01 '11 at 09:45

answered Dec 01 '11 at 09:27

Ilan Huberman

406
1
3
15

values mean bytes. i'm working on a 32-bit windows. i think binary files aren't structured by lines to use StreamReader.ReadLine! – Abdelilah Outifaout Dec 01 '11 at 09:30
It's a binary file, so could not have _"lines"_ IMO. – Marco Dec 01 '11 at 09:30
@Outifaout I've edit the post and omitted the StreamReader.ReadLine method. – Ilan Huberman Dec 01 '11 at 09:36
the problem with blocks is that it has the risk to miss some occurences.. For ex, i'm searching "D3A8AF" and one block ends by "D3" and the next one starts by "A8AF" ! – Abdelilah Outifaout Dec 01 '11 at 09:45

The fastest search algorithm into binary files?

4 Answers4