Increasing performance in file manipulation

Question

I have a file which contains a matrix of numbers as following:

0 10 24 10 13 4 101 ...
6 0 52 10 4 5 0 4 ...
3 4 0 86 29 20 77 294 ...
4 1 1 0 78 100 83 199 ...
5 4 9 10 0 58 8 19 ...
6 58 60 13 68 0 148 41 ...
. .
.   .
.     .

What I am trying to do is sum each row and output the sum of each row to a new file (with the sum of each row on a new line).

I have tried doing it in Haskell using ByteStrings, but the performance is 3 times a slow as the python implementation. Here is the Haskell implementation:

import qualified Data.ByteString.Char8 as B

-- This function is for summing a row
sumrows r = foldr (\x y -> (maybe 0 (*1) $ fst <$> (B.readInt x)) + y) 0 (B.split ' ' r)

-- This function is for mapping the sumrows function to each line
sumfile f = map (\x -> (show x) ++ "\n") (map sumrows (B.split '\n' f)) 

main = do
  contents <- B.readFile "telematrix"
  -- I get the sum of each line, and then pack up all the results so that it can be written
  B.writeFile "teleDensity" $ (B.pack . unwords) (sumfile contents)
  print "complete"

This takes about 14 seconds for a 25 MB file.

Here is the python implemenation

fd = open("telematrix", "r")
nfd = open("teleDensity", "w")

for line in fd: 
  nfd.write(str(sum(map(int, line.split(" ")))) + "\n")

fd.close()
nfd.close()

This takes about 5 seconds for the same 25 MB file.

Any suggestions on how to increase the Haskell implementation?

Could you link to the file? On my computer your Haskell program runs 4 times faster than the Python one with a random 1000 x 1000 matrix (which is 13 Mb). — András Kovács, Jun 28 '15 at 06:30
Here is the file: https://www.dropbox.com/s/5nhttzeytkxzmwm/telematrix?dl=0 — abden003, Jun 28 '15 at 07:31
I think I might have figured out why it was so slow. I was using runhaskell instead of compiling it with ghc and then running it. — abden003, Jun 28 '15 at 07:49
@abden003 If it turns out it was indeed caused just by `runhaskell`, please consider adding it as your own answer (and accept it later) so that the question is marked as resolved. — Petr, Jun 28 '15 at 09:19
You should use `foldl' (+) 0` to sum integers, not `foldr (+) 0` (you've manually fused this with another loop, but I hope you see what I mean). — Reid Barton, Jun 28 '15 at 15:08

score 1 · Answer 1 · answered Jun 29 '15 at 00:30

1

It seems that he problem was that I was compiling and running the program with runhaskell as opposed to using ghc and then running the program. By compiling first and then running, I increased performance to 1 second in Haskell

answered Jun 29 '15 at 00:30

abden003

1,325
7
24
48

score 0 · Answer 2 · answered Jun 28 '15 at 06:05

At a glance, I would bet your first bottleneck is in the ++ on strings in sumfile, which is destructuring the left operand each time and rebuilding it. Instead of appending "\n" to the end, you could replace the unwords function call with unlines, which does exactly what you want it to here. That should get you a nice little speed boost.

A more minor nitpick is that the (*1) in the maybe function is unneeded. Using id there would be more efficient, since (*1) wastes a multiplication operation, but that's no more than a few processor cycles.

Then finally, I have to ask why you're using ByteString's here. ByteString's store string data efficiently as an array, like traditional strings in a more imperative language. However, what you're doing here involves splitting the string and iterating over the elements, which are operations that linked lists would be suited for. I would honestly recommend using the traditional [Char] type in this case. That B.split call may be what's ruining you, since it has to take the entire line and copy it into separate arrays of the split form, whereas the words function for linked lists of characters simply splits the linked structure off at a few points.

`B.split` doesn't copy data, it returns slices. `ByteString` is definitely much faster than lists for the OP's task. Also, the `++` in `sumfile` is a one-off small O(n) cost per line that shouldn't affect the total runtime much. — András Kovács, Jun 28 '15 at 06:38
I'm quite sure using `ByteString` and `B.split` is ok and actually helps performance. The library is smart not to copy the arrays, it creates splices pointing to the original piece of data. In particular the documentation for [`split`](https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString-Char8.html#v:split) says: > As for all splitting functions in this library, this function does not copy the substrings, it just constructs new ByteStrings that are slices of the original. — Petr, Jun 28 '15 at 08:09

score 0 · Accepted Answer · answered Jul 21 '15 at 20:29

0

The main reason for the poor performance was because I was using runhaskell instead of first compiling and then running the program. So I switched from:

runhaskell program.hs

to

ghc program.hs

./program

answered Jul 21 '15 at 20:29

abden003

1,325
7
24
48

Increasing performance in file manipulation

3 Answers3