what's the fastest way to scan a very large file in java?

Question

Imagine I have a very large text file. Performance really matters.

All I want to do is to scan it to look for a certain string. Maybe I want to count how many I have of those, but it really is not the point.

The point is: what's the fastest way ?

I don't care about maintainance it needs to be fast.

Fast is key.

More importantly: does it need to be fast once or do you need to search the same source multiple times (for different Strings obviously)? — Joachim Sauer, Feb 03 '11 at 12:45

Joel · Accepted Answer · 2011-02-03T13:41:26.080

16

For a one off search use a Scanner, as suggested here

A simple technique that could well be considerably faster than indexOf() is to use a Scanner, with the method findWithinHorizon(). If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.

edited Feb 03 '11 at 13:41

answered Feb 03 '11 at 13:35

Joel

29,538
35
110
138

Nice shortcut to get everything I suggested without implementing it manually! – Michael Borgwardt Feb 03 '11 at 14:17
note: this may end up loading the whole file into memory – Aarjav Oct 16 '17 at 21:46

score 4 · Answer 2 · answered Feb 03 '11 at 12:44

4

First of all, use nio (FileChannel) rather than the java.io classes. Second, use an efficient string search algorithm like Boyer-Moore.

If you need to search through the same file multiple times for different strings, you'll want to construct some kind of index, so take a look at Lucene.

answered Feb 03 '11 at 12:44

Michael Borgwardt

342,105
78
482
720

why nio instead of io? the nio classes are for scaling, not necessarily for speed. – jtahlborn Feb 03 '11 at 14:01
@jtahlborn: You're mistaking one aspect (scalable networking via selectors) for the whole. The nio classes can also speed up file operations a lot by avoiding copy operations. For example (and relevant to this question), a MappedByteBuffer can directly use the disk page data as provided by the OS, whereas a BufferedInputStream has to copy it because it's built in top of the InputStream interface. – Michael Borgwardt Feb 03 '11 at 14:26
in order to work with the data, you will still need to copy it into the java heap. so, for operations which are a one time read through the file, i doubt that this will make any significant difference. – jtahlborn Feb 03 '11 at 14:43

score 1 · Answer 3 · answered Feb 03 '11 at 12:43

1

Load the whole file into memory and then look at using a string searching algorithm such as Knuth Morris Pratt.

Edit:
A quick google shows this string searching library that seems to have implemented a few different string search algorithms. Note I've never used it so can't vouch for it.

answered Feb 03 '11 at 12:43

Qwerky

18,217
6
44
80

Yes but to load it into memory you got to read it off disk first - unless you need to do more than one search (maybe, the OP doesn't specify) you should just parse the stream. – Richard H Feb 03 '11 at 13:19

score 0 · Answer 4 · answered Feb 03 '11 at 12:31

0

Whatever may be the specifics, memory mapped IO is usually the answer.

Edit: depending on your requirements, you could try importing the file into an SQL database and then leveraging the performance improvements through JDBC.

Edit2: this thread at JavaRanch has some other ideas, involving FileChannel. I think it might be exactly what you are searching.

answered Feb 03 '11 at 12:31

Please treat your mods well.

4,321
1
25
35

4

How the hell could JDBC possibly help in any way? What "performance improvements" are you talking about? – Michael Borgwardt Feb 03 '11 at 12:41

score 0 · Answer 5 · answered Feb 03 '11 at 12:38

I'd say the fastest you can get will be to use BufferedInputStreams on top of FileInputStreams... or use custom buffers if you want to avoid the BufferedInputStream instantiation.

This will explain it better than me : http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

score 0 · Answer 6 · answered Feb 03 '11 at 12:55

0

Use the right tool: full text-search library

My suggestion is to do a in-memory index (or file based index with caching enabled) and then perform the search on it. As @Michael Borgwardt suggested, Lucene is the best library out there.

answered Feb 03 '11 at 12:55

Aravind Yarram

78,777
46
231
327

score 0 · Answer 7 · answered Feb 03 '11 at 13:09

0

I don't know if this is a stupid suggestion, but isn't grep a pretty efficient file searching tool? Maybe you can call it using Runtime.getRuntime().exec(..)

answered Feb 03 '11 at 13:09

Adriaan Koster

15,870
5
45
60

score 0 · Answer 8 · answered Feb 03 '11 at 13:25

It depends on whether you need to do more than one search per file. If you need to do just one search, read the file in from disk and parse it using the tools suggested by Michael Bogwart. If you need to do more than one search, you should probably build an index of the file with a tool like Lucene: read the file in, tokenise it, stick tokens in index. If the index is small enough, have it in RAM (Lucene gives option of RAM or disk-backed index). If not keep it on disk. And if it is too large for RAM and you are very, very, very concerned about speed, store your index on a solid state/flash drive.

what's the fastest way to scan a very large file in java?

8 Answers8

Linked