How can I speed up scalaz-stream text processing?

Question

How can I speed up the following scalaz-stream code? Currently it takes about 5 minutes to process 70MB of text, so I am probably doing something quite wrong, since a plain scala equivalent would take a few seconds.

(follow-up to another question)

  val converter2: Task[Unit] = {
    val docSep = "~~~"
    io.linesR("myInput.txt")
      .flatMap(line => { val words = line.split(" ");
          if (words.length==0 || words(0)!=docSep) Process(line)
          else Process(docSep, words.tail.mkString(" ")) })
      .split(_ == docSep)
      .filter(_ != Vector())
      .map(lines => lines.head + ": " + lines.tail.mkString(" "))
      .intersperse("\n")
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("correctButSlowOutput.txt"))
      .run
  }

Just a wild guess here, but maybe `io.linesR` and the `.to(io.fileChunkW...)` parts aren't using buffered streams? — Dylan, May 14 '15 at 17:06
I'm not sure about this case in particular, but Scalaz will tend to do a lot of generic operations on characters, which results in every character being boxed, which really slow things down. Have you checked what happens if you split it up into pre-pipe and post-pipe operations (i.e. run the first half and store it in a buffer, then output the second half)? — Rex Kerr, May 14 '15 at 18:32

user1763729 · Answer 1 · 2015-05-14T23:56:12.580

I think you could just use one of the process1 chunk methods to chunk. If you want a lot parallel processing on the merge of the lines into your output format, decide if ordered output is important and use a channel combined with a merge or tee. This will also make it reusable. Because you are doing a very small amount of processing you are probably swamped with overhead so you have to work harder to make your unit of work large enough not to be swamped.

mitchus · Accepted Answer · 2015-06-28T11:31:32.723

The following is based on @user1763729 's suggestion of chunking. It feels clunky though, and it's just as slow as the original version.

  val converter: Task[Unit] = {
    val docSep = "~~~"
    io.linesR("myInput.txt")
      .intersperse("\n") // handle empty documents (chunkBy has to switch from true to false)
      .zipWithPrevious // chunkBy cuts only *after* the predicate turns false
      .chunkBy{ 
        case (Some(prev), line) => { val words = line.split(" "); words.length == 0 || words(0) != docSep } 
        case (None, line) => true }
      .map(_.map(_._1.getOrElse(""))) // get previous element
      .map(_.filter(!Set("", "\n").contains(_)))
      .map(lines => lines.head.split(" ").tail.mkString(" ") + ": " + lines.tail.mkString(" "))
      .intersperse("\n")
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("stillSlowOutput.txt"))
      .run
  }

EDIT:

Actually, doing the following (just reading the file, no writing or processing) already takes 1.5 minutes, so I guess there's not much hope to speed this up.

  val converter: Task[Unit] = {
    io.linesR("myInput.txt")
      .pipe(text.utf8Encode)
      .run
  }

How can I speed up scalaz-stream text processing?

2 Answers2