Fast Search to see if a String Exists in Large Files with Delphi

Question

I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.

enter image description here

If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:

  var
    FileContents: TStringlist;

  begin
    FileContents.LoadFromFile(Filepath);
    if Pos(TextToFind, FileContents.Text) = 0 then
      Found := false
    else 
      Found := true;

The above code is simple, and it generally works okay. But it has two problems:

It fails for very large files (e.g. 300 MB)
I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?

I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.

So how can I speed this search up and also make it work for very large files?

Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?

Solution:

Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.

But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.

So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.

That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).

So I'm off and running. Thanks once again SO community.

Have you checked to see if you can access the Windows Search feature programmatically? It has already indexed a lot of your user's hard drive, perhaps. — Warren P, Feb 17 '11 at 15:02
@Warren: Interesting idea. Probably fraught with problems, but interesting none-the-less. — lkessler, Feb 17 '11 at 19:04

Thorsten Engler · Answer 1 · 2011-02-16T11:44:48.693

12

The best approach here is probably to use memory mapped files.

First you need a file handle, use the CreateFile windows API function for that.

Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.

To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)

To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.

Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.

Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).

EDIT:

A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.

Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.

If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).

Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.

edited Feb 16 '11 at 11:44

answered Feb 16 '11 at 05:28

Thorsten Engler

2,333
12
13

Sounds interesting. Have you seen example code that does this? – lkessler Feb 16 '11 at 05:33
I have used memory mapped files quite often for different purposed. Not specifically to search for text in the way you want to do but what you specifically do with the bytes that are mapped into memory doesn't really matter. (I've edited my answer above to add more information). – Thorsten Engler Feb 16 '11 at 05:44
Isn't a memory mapped file simply a convenience. Doesn't it have the same performance characteristic as reading into a buffer of the same size as the view window? – David Heffernan Feb 16 '11 at 06:51
IMHO this is the best approach, a few up-votes would NOT hurt your fingers – Feb 16 '11 at 07:29
1

@David, as I've explained in my edit above (which I did before you comment), it is more than just an convenience, using memory mapped files saves the OS from having to copy the bytes from the OS file cache into a page allocated by the application, it instead just maps the same physical page into the application address space (as long as you keep your offsets aligned by multiple of at least 4, better 64kb). Also, it will make loading/scanning run in parallel. The OS only loads in pages when required, but it detects a sequential access pattern and starts reading ahead in a background thread. – Thorsten Engler Feb 16 '11 at 07:47
@Thorsten OK I see. Sorry I didn't read all the detail of your post first time round! – David Heffernan Feb 16 '11 at 07:49
You will also have to handle the edge case where the search string "straddles" the blocks. Easiest way would be to start the next block Length(s) before the end of the current block. – Gerry Coll Feb 16 '11 at 07:55
1

Memory mapped files might also fail for large files because Windows needs to find an contiguous block of free address space. Since a Delphi app is a 32bit app and most likely doesn't use the "3G" flag, it only has 2Gb of address space available to it. Some of that is already taken. Mapping a 2Gb file would be impossible, I have no idea if 1Gb is possible. If you need to support large files, memory mapped files are not the way to go. – Cosmin Prund Feb 16 '11 at 08:30
1

@Cosmin, that's the whole point of mapping the file in smaller sections, please actually read my answer instead of jumping from the first line right to posting a comment. Thanks. – Thorsten Engler Feb 16 '11 at 09:44
@Gerry, yes, you are right, if the file is larger than whatever the maximum size that is mapped at one time, then the following mappings should overlap the previous one by a multiple of 4kb. – Thorsten Engler Feb 16 '11 at 09:45
2

With memory mapped files chunked in small pieces (like 32 MB), you'll loose to identify some text content if this text is split between two 32 MB buffers. The implementation has to take care of that! In practice, a 32 MB memory mapped file won't make it faster than a plain TFileStream read into a fixed 32 MB buffer, if you have to go through all file content from the beginning to the end. The memory copy you spoke about from file cache won't make any big difference: the bottleneck will be in StrILComp(), not in this memory copy. – Arnaud Bouchez Feb 16 '11 at 09:57
@A.Bouchez, once the memory mapping code has been implemented, which is only insignificantly more complex then using TFileStream and a buffer that you are reading into (where you ALSO have to take care of handling the case where your search text is split), it is then always possible to improve on the search performance by using something more efficient than the StrILComp. Usage of memory mapped files is very simple to implement and is without doubt more efficient than using a TFileStream. – Thorsten Engler Feb 16 '11 at 11:49
I'm using memory mapped files for fast random access of data. But for reading huge files from the beginning to the end, it's perhaps not worth it. And the FASTER won't be using memory mapped files, but a Full Text index: requests will be immediate. Need only to parse the text once for creating the index. Like if you would try to learn to read faster to find some text in a book, whereas just using the index available at the end of the book will lead you right to the page. :) – Arnaud Bouchez Feb 16 '11 at 16:15
@A.Bouchez. FWIW, I was able to write some (messy) code to test this basic idea fairly quickly. I think the big advantahe is that the OS will automatically start reading ahead in the file. – Gerry Coll Feb 16 '11 at 20:14
@Gerry I think the OS will also be smart enough to read the file content ahead, also for direct file access. All modern file system does this (at least on Linux, where the internal are documented and source code available). I'm quite sure the NTFS driver will do the same. We are not any more on MSDOS, which was calling more or less directly the BIOS disk functions when you wanted to read some data (there was some buffering even in this glorious time, AFAIR)... :) – Arnaud Bouchez Feb 17 '11 at 06:21
@Thorsten Take a wall clock, write a program to read 2 GB of files from the beginning to the end using a TFileStream or a memory mapped file, and you'll see. I've done this, and if your data don't fit in memory, it's not faster. The bottleneck is definitively the HD access. A full text index will make any search immediate, whatever the file size are. Using memory mapped files won't help much. Google don't read all their file content when you make a search. You'll have to wait for days before gettting an answer (even with memory mapped files). They use indexes. That was my point. – Arnaud Bouchez Feb 17 '11 at 06:25

score 3 · Answer 2 · answered Feb 16 '11 at 06:41

3

If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.

To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.

Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.

answered Feb 16 '11 at 06:41

Simon

9,197
13
72
115

Searching Unicode isn't hard. You can still do it on a bytes level trivially for most encodings. It takes a little extra care for UTF-8 I suspect. – David Heffernan Feb 16 '11 at 06:53
To get the source component: http://cc.embarcadero.com/Item/12452, it also supports case sensitive and insensitive searches - (20MB file in around 100 milliseconds) – Simon Feb 16 '11 at 15:13
that would probably be a great component if it were updated to Delphi 2009. It's only up to Delphi 7 so it's likely not Unicode enabled. Has someone updated it somewhere? – lkessler Feb 16 '11 at 19:40
I know, unfortunately you will need to modify the source to handle unicode strings. Im still on D7 so i dont know exactly whats involved - does it just require changing string to unicodestring (im guessing to easy to be true). If you look at the code it will give you an idea of the methodology though – Simon Feb 17 '11 at 00:52

score 2 · Accepted Answer · edited May 23 '17 at 12:30

2

This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.

edited May 23 '17 at 12:30

Community

1
1

answered Feb 16 '11 at 05:09

mghie

32,028
6
87
129

3

And you could use Boyer-Moore searching if you were particularly concerned, but it's only slightly faster compared to not loading more from disk than you have to. If you were counting matches in an alreaqdy loaded file it'd be very handy. – Feb 16 '11 at 05:17
That question was to about reading the first few lines of each file. This one is different because it potentially requires reading and scanning of the entire file. Yes, I guess I could code the blocking fairly easily - but then do I still do the pos() function for each block? Also I'd have to experiment to estimate the best block size and figure out how to best overlap the blocks so that it will catch some text that crosses a block boundary. I was really hoping it wouldn't be that involved, and that someone had another solution that would be quicker and easier to implement. – lkessler Feb 16 '11 at 05:25
@lkessler: Exactly, it requires only *potentially* to read everything. Your text entry field for the search text looks like it was for phrases that don't span multiple lines. The answers to your other question show how to efficiently read large files line by line. Whether you use `Pos()` to search in each line or something faster and / or with more features (like case insensitive search) is another question. – mghie Feb 16 '11 at 07:00

FileVoyager · Answer 4 · 2013-04-10T15:23:34.670

2

May I suggest a component ? If yes I would recommend ATStreamSearch. It handles ANSI and UNICODE (and even EBCDIC and Korean and more).

Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.

edited Apr 10 '13 at 15:23

answered Feb 16 '11 at 09:12

FileVoyager

709
7
16

+1 A good set of components here that I haven't seen before. The streamsearch routine blocks reads, indicates progress, and allows case and regex searches. However, it is not designed for speed and searches at only about 1 MB per second. – lkessler Feb 16 '11 at 14:18

score 0 · Answer 5 · answered Feb 16 '11 at 05:41

It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.

The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.

This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.

If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.

score 0 · Answer 6 · answered Feb 16 '11 at 10:10

If the files are to be searched multiple times, it could be a good idea to use a word index.

This is called "Full Text Search".

It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.

You have the exact parser you need in The Delphi Magazine Issue 78, February 2002: "Algorithms Alfresco: Ask A Thousand Times Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."

There are several FTS implementation for Delphi:

Rubicon
Mutis
ColiGet
Google is your friend..

I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such. We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.

Fast Search to see if a String Exists in Large Files with Delphi

6 Answers6

Linked

Related