4

I am surprised that this throws an out of memory error considering that the operations are on top of an scala.collection.Iterator. The size of the individual lines are small (< 1KB)

Source.fromFile("largefile.txt").getLines.map(_.size).max

It appears it is trying to load the entire file in memory. Not sure which step triggers this. This is disappointing behavior for such a basic operation. Is there a simple way around it. And any reason for this design by the library implementors ?

Tried the same in Java8.

Files.lines(Paths.get("largefile.txt")).map( it -> it.length() ).max(Integer::max).get
//result: 3131

And this works predictably. Files.lines returns java.util.stream.Stream and the heap does not explode.

update: Looks like it boils down to new line interpretation. Both files are being interpreted as UTF-8, and down the line they both call java.io.BufferedReader.readLine(). So, still need to figure out where the discrepancy is. And I compiled both snippets Main classes in to the same project jar.

smartnut007
  • 6,324
  • 6
  • 45
  • 52
  • Lots ... The file size is bigger than heap size. Thats not the point. I expect that to not matter since its an iterator. – smartnut007 Mar 03 '15 at 02:29
  • 3
    How big a file to reproduce this? I just ran this on 10 GB with no problems. – Ben Reich Mar 03 '15 at 03:43
  • 5
    Are you *sure* each line is less than 1KB? If there aren't any line breaks, then calling `_.size` will build a very large `String` and quickly exhaust the memory. – Michael Zajac Mar 03 '15 at 03:53
  • 2
    There must be a very large line somewhere which causes the exception. This is similar to: http://stackoverflow.com/questions/24334549/outofmemory-error-when-using-apache-commons-lineiterator. Although it's an iterator, it iterates over each line, not each character so it's not very safe. Write your own function that does not load the whole line into memory but rather counts chars between line breaks. – yǝsʞǝla Mar 03 '15 at 09:44
  • Yeah, I did ensure that the size of the line is small. As you can see from the result of the output from the Java8 snippet. Also, both are interpreted as utf-8. – smartnut007 Mar 04 '15 at 19:14

1 Answers1

3

I'm willing to be the issue is that you're counting 'lines' differently than the getLines is. From the API:

(getLines) Returns an iterator who returns lines (NOT including newline character(s)). It will treat any of \r\n, \r, or \n as a line separator (longest match) - if you need more refined behavior you can subclass Source#LineIterator directly.

Try executing this against the file in question:

  Source.fromFile("testfile.txt").getLines().
    zipWithIndex.map{ case(s, i) => (s.length, i)}.
      foreach(e=> if (e._1 > 1000) println(
        "line: " + e._2 + " is: " + e._1 + " bytes!"))

this will tell you how many lines in the file are larger than 1K, and what the index is of the offending line.

ntalbs
  • 28,700
  • 8
  • 66
  • 83
snerd
  • 1,238
  • 1
  • 14
  • 28
  • also - if this blows up before printing anything - you can easily refactor this line to print/log every line number - which will also help you locate the index of the problem. – snerd Mar 03 '15 at 19:20
  • Does not address my question at all. Sorry. – smartnut007 Mar 04 '15 at 02:25
  • 1
    @smartnut007 - yes it surely does. you're operating under the assumption that you haven't made a mistake yet have failed to show how you came to that conclusion. I, Ben Reich, m-z, and Aleksey Izmailov have all stated essentially the same thing -- the issue is likely programmer error. As such, why don't you show us that it's not? What output do you get when you run the above code? What is your response to Ben Reich's comment that he was unable to reproduce your results with a 10GB file? – snerd Mar 04 '15 at 16:23
  • Yeah, I did ensure that the size of the line is small. As you can see from the result of the output from the Java8 snippet. Also, both are interpreted as utf-8 – smartnut007 Mar 04 '15 at 19:14
  • Yes, it does address the question. It appears to boil down to the new line interpretation. When you say "longest match" do you mean like regex greedy match ?. Also, Is there a default way to make the behavior the same as the Java version. – smartnut007 Mar 04 '15 at 19:39
  • w00t - the new info helps! On the assumption that Oracle's JDK8 code for this isn't radically different than what's implemented in OpenJDK, I checked out the source and confirmed that both scala and java are using the same delimiting rules. while a line length of 3131 > 1KB, there's still no reason for the out of heap space issue. question: are you running the scala code in the context of a larger program? if not, what happens when you run the operation from the scala console? – snerd Mar 04 '15 at 19:45
  • oops - forgot about your other comment question - I don't know yet how, if at all, scala differs from java in the code implementation. as for 'longest match' to my read there are multiple combinations of /n or /r that can be counted as line delimiters - so if there is a line that could be read as ending in more than one spot, what gets returned is the line delimited in such a way so as to create the longest line. – snerd Mar 04 '15 at 19:49