7

Imagine I have a very large text file. Performance really matters.

All I want to do is to scan it to look for a certain string. Maybe I want to count how many I have of those, but it really is not the point.

The point is: what's the fastest way ?

I don't care about maintainance it needs to be fast.

Fast is key.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
chacko
  • 5,004
  • 9
  • 31
  • 39

8 Answers8

16

For a one off search use a Scanner, as suggested here

A simple technique that could well be considerably faster than indexOf() is to use a Scanner, with the method findWithinHorizon(). If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.

Joel
  • 29,538
  • 35
  • 110
  • 138
4

First of all, use nio (FileChannel) rather than the java.io classes. Second, use an efficient string search algorithm like Boyer-Moore.

If you need to search through the same file multiple times for different strings, you'll want to construct some kind of index, so take a look at Lucene.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • why nio instead of io? the nio classes are for scaling, not necessarily for speed. – jtahlborn Feb 03 '11 at 14:01
  • @jtahlborn: You're mistaking one aspect (scalable networking via selectors) for the whole. The nio classes can also speed up file operations a lot by avoiding copy operations. For example (and relevant to this question), a MappedByteBuffer can directly use the disk page data as provided by the OS, whereas a BufferedInputStream has to copy it because it's built in top of the InputStream interface. – Michael Borgwardt Feb 03 '11 at 14:26
  • in order to work with the data, you will still need to copy it into the java heap. so, for operations which are a one time read through the file, i doubt that this will make any significant difference. – jtahlborn Feb 03 '11 at 14:43
1

Load the whole file into memory and then look at using a string searching algorithm such as Knuth Morris Pratt.

Edit:
A quick google shows this string searching library that seems to have implemented a few different string search algorithms. Note I've never used it so can't vouch for it.

Qwerky
  • 18,217
  • 6
  • 44
  • 80
  • Yes but to load it into memory you got to read it off disk first - unless you need to do more than one search (maybe, the OP doesn't specify) you should just parse the stream. – Richard H Feb 03 '11 at 13:19
0

Whatever may be the specifics, memory mapped IO is usually the answer.

Edit: depending on your requirements, you could try importing the file into an SQL database and then leveraging the performance improvements through JDBC.

Edit2: this thread at JavaRanch has some other ideas, involving FileChannel. I think it might be exactly what you are searching.

0

I'd say the fastest you can get will be to use BufferedInputStreams on top of FileInputStreams... or use custom buffers if you want to avoid the BufferedInputStream instantiation.

This will explain it better than me : http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

Kellindil
  • 4,523
  • 21
  • 20
0

Use the right tool: full text-search library

My suggestion is to do a in-memory index (or file based index with caching enabled) and then perform the search on it. As @Michael Borgwardt suggested, Lucene is the best library out there.

Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
0

I don't know if this is a stupid suggestion, but isn't grep a pretty efficient file searching tool? Maybe you can call it using Runtime.getRuntime().exec(..)

Adriaan Koster
  • 15,870
  • 5
  • 45
  • 60
0

It depends on whether you need to do more than one search per file. If you need to do just one search, read the file in from disk and parse it using the tools suggested by Michael Bogwart. If you need to do more than one search, you should probably build an index of the file with a tool like Lucene: read the file in, tokenise it, stick tokens in index. If the index is small enough, have it in RAM (Lucene gives option of RAM or disk-backed index). If not keep it on disk. And if it is too large for RAM and you are very, very, very concerned about speed, store your index on a solid state/flash drive.

Richard H
  • 38,037
  • 37
  • 111
  • 138