9

I had this piece of code:

while((line=br.readLine())!=null)
        {
            String Words[]= line.split(" ");
            outputLine = SomeAlgorithm(Words);
            output.write(outputLine);
        }

As you can see in the above code, for every line in the input file I'm reading one line, running some algorithm on it which modifies that line read basically, and then writes the output line to some file.

There are 9k lines in the file, and the entire program took 3 minutes on my machine.

I thought, okay, I'm doing 2 I/Os for every (line) run of the algorithm. So I'm doing around 18k I/Os. Why not collect all the lines first into an ArrayList , then loop through the list and run the algorithm on each line? Also collect each output into one string variable, and then write out all the output once at the end of the program.

That way, I'd have total 2 big I/Os for the entire program (18k small File I/Os to 2 big File I/Os). I thought this would be faster, so I wrote this:

List<String> lines = new ArrayList<String>();
while((line=br.readLine())!=null)
        {
            lines.add(line); // collect all lines first
        }

for (String line : lines){
    String Words[] = line.split(" ");
    bigOutput+=SomeAlgorithm(Words); // collect all output
}

output.write(bigOutput);

But, this thing took 7 minutes !!!

So, why is looping through ArrayList slower than reading a file line by line?

Note : Collecting all lines by readLine() and writing the bigOutput are each taking only a few seconds. There is no change made to SomeAlgorithm() either. So, definitely, I think the culprit is for (String line: lines)

Update: As mentioned in the various comments below, the problem was not with ArrayList traversal , it was with the way the output was accumulated using += . Shifting to StringBuilder() did give a faster-than-original result.

sanjeev mk
  • 4,276
  • 6
  • 44
  • 69
  • 11
    The [string] `+=` to build the result is likely the problem and *not* the ArrayList. Switch to a StringBuilder but better, *just use buffered output*. The buffer output the OS cache will handle this just fine. – user2864740 Aug 10 '14 at 18:35
  • Try using LinkedList? – mrVoid Aug 10 '14 at 18:35
  • @mrVoid That wouldn't help even if the list was the issue (it's probably not, see the first comment). –  Aug 10 '14 at 18:36
  • Try `new ArrayList<>(10000)` in order to avoid excessive number of reallocations inside `ArrayList`. – SJuan76 Aug 10 '14 at 18:37
  • 4
    Use a buffered reader and a buffered writer, which will reduce the number of file IO tremendously, while still allowing you to use your basic, simple algorithm. – JB Nizet Aug 10 '14 at 18:37
  • Thanks @user2864740 , the += was the big problem. Back to normal running times now – sanjeev mk Aug 10 '14 at 18:48
  • Take a look at the [`Files`](http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html) utility class. It already has methods for reading all lines in a file to a `List` and writing them back to a file. These methods are optimised and tested more than your code could possibly be. – Boris the Spider Aug 10 '14 at 21:49
  • Related: http://stackoverflow.com/a/22244296/616460 – Jason C Aug 11 '14 at 01:10
  • BTW: Java 7 offers a [`Files#readAllLines()`](http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#readAllLines(java.nio.file.Path,%20java.nio.charset.Charset)) method. – eckes Aug 24 '14 at 01:04
  • `So I'm doing around 18k I/Os.` You shouldn't: `br` should be a `BufferedReader`, with highly tuned buffering without loading the whole file in memory at once. – Blaisorblade Aug 24 '14 at 07:45

2 Answers2

3

I suspect the difference in performance is due to how you are collecting the output in one variable (bigOutput). My conjecture is that this involves lots of memory reallocations and copying of character data, which is the real cause of the slowness.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 1
    To be precise, the lots of reallocation and copying are due to *how* it's accumulated (string `+=`), not so much inherent to the accumulation (`StringBuffer`). –  Aug 10 '14 at 18:37
  • @delnan Yes, the += was the problem! – sanjeev mk Aug 10 '14 at 18:47
1

This depends on the size of the file, but likely what is going on here is that it takes longer to resize the ArrayList storage and concatenate strings multiple times than it does to make a lot of small file operations.

Bear in mind that the disk and the OS both perform some level of I/O caching, and some of this involved read-ahead (with the expectation being that you are likely to read data sequentially) so the first read is likely stuffing quite a bit of the file into the I/O cache, from which you can read very quickly.

You are therefore trading small reads from the I/O cache for many resizes of flat arrays (the ArrayList and output sting), both of which become slower and slower each time.

tl;dr version: Let the various I/O caches do their job.

cdhowie
  • 158,093
  • 24
  • 286
  • 300
  • 3
    I just tried a test that puts 9000 about 100 character strings into an array list and it took 15ms. Your answer can be completely correct if the asker is running on a potato – Philipp Gayret Aug 10 '14 at 18:39
  • 1
    ArrayList resizing shouldn't be a significant factor, especially when it sits next to the O(n²) string additions. –  Aug 10 '14 at 18:40
  • 1
    The `ArrayList` growing should not be such problem, since the size grows exponentionaly, I think the most problem is in `+=` string concatenation, so replacing that with `StringBuilder` should rapidly improve the algorith. – kajacx Aug 10 '14 at 18:41
  • Also tested the big string concatenation, this took 3.5 seconds on my machine. ( 9000 lines, 100 chars each ). Almost identical code as in example – Philipp Gayret Aug 10 '14 at 18:41
  • The += was the problem. I removed it and returned to writing output after each run of the algorithm, and it ran in 2 minutes 30 seconds! – sanjeev mk Aug 10 '14 at 18:47
  • @sanjeevmk I would still note that it's a bad idea to read a whole file into memory before processing it, unless your processing is mostly going to be random access (jumping around all over the file). As the file gets bigger you are going to increase memory consumption for no good reason. This can result in other memory pages being swapped out, lowering overall system performance. As a general rule, if you need to process the file sequentially then only read in what you need, and discard it when you are done with it. – cdhowie Aug 10 '14 at 18:49
  • @cdhowie As always,generalizations are bad ;-)) There are many cases where reading an entire file using some low-level bulk i/o operation, followed by a pass over the byte[] (or whatever) constituing the file data is faster than a line-by-line approach. Of course, doing essentially the same readLine into an ArrayList is no gain as compared to doing that in a blow-by-blow loop. – laune Aug 10 '14 at 19:12