12

Modern day CPUs are optimised so that access and modification of the same place in memory (temporal locality), as well as consecutive places in memory (spatial locality) are extremely fast operations.

Now, since Haskell is a purely immutable language, you naturally can't overwrite existing memory blocks, potentially making things like foldl much slower than a for loop with a continuously accessed result variable would be in C.

Does Haskell do anything internally to mitigate this performance loss? And in general, what are its properties regarding locality?

Electric Coffee
  • 11,733
  • 9
  • 70
  • 131
  • 5
    of course Haskell does not specify this - so it will depend on the implementation (most likely GHC) and I think it will be *smart* enough to compile something like `foldl` into a loop (if not GHC itself maybe even the backend will manage) - but I am really just guessing - *of course* you can always try it for yourself and have a look at the output ;) – Random Dev Apr 24 '15 at 09:41
  • 3
    Reads still benefit from locality. Mutable arrays, in suitable monads, should have equivalent performance as in imperative languages. Immutable data structures of course do not allow simple in-place modification. In some cases GHC may optimize this (e.g. tight numeric loops do not allocate new integers at every iteration). OTOH, having immutability greatly helps in parallelizing your code without frequent cache invalidation. – chi Apr 24 '15 at 09:47

2 Answers2

12

The general rule is that for "vanilla" Haskell programming you get very little (if any) control over memory layout and memory locality.

However, there do exist a number of more advanced features that allow such control, and libraries that expose friendly abstractions on top of these. The vector library is probably the most popular of the latter. This library provides several fixed-size array types, two of which (Data.Vector.Unboxed and Data.Vector.Storable) give you data locality by representing vectors and their contents as contiguous memory arrays. Data.Vector.Unboxed even contains a simple automatic "structure of arrays" transformation—an unboxed vector of pairs will be represented as a pair of unboxed vectors, one for each of the pairs' components.

Another example is the JuicyPixels library for image processing, which represents images in memory as contiguous bitmaps. This actually bottoms out to Data.Vector.Storable, which exploits a standard facility (Foreign.Storable) for translating user-defined Haskell data types to and from raw bytes.

But the general pattern is this: in Haskell, when you're interested in memory locality, you identify which data needs to benefit from it and bundle it together in a custom data type whose implementation was designed to provide locality and performance guarantees. Writing such a data type is an advanced undertaking, but most of the legwork has been done already in a reusable fashion (note for example that JuicyPixels mostly just reuses vector).

Note also that:

  1. vector provides stream fusion optimizations to eliminate intermediate arrays when you apply nested vector transformations. If you generate a vector from 0 to 1,000,000, filter out the even numbers, map the (^2) function over that and sum the elements of the result, no array is ever allocated—the library has the smarts to rewrite that to an accumulator loop from 0 to 1,000,000. So a foldl of a vector isn't necessarily slower than a for loop—there might be no array at all!
  2. vector provides mutable arrays as well. More generally, in Haskell you can overwrite existing memory if you really insist. It's just (a) not the default paradigm in the language, and therefore (b) a little bit clunky, but absolutely tractable if you just need it in a few performance-sensitive spots.

So most of the time, the answer to "I want memory locality" is "use vector."

Luis Casillas
  • 29,802
  • 7
  • 49
  • 102
  • in regards to your stream fusion it would still need to store the data in memory somehow, you can't really modify something that isn't there – Electric Coffee Apr 24 '15 at 21:14
  • The carefully-selected operations in the stream fusion example need no more than one element of the "vector" at a time, and all of the data will live in registers or possibly on the stack. Of course, you could have just written an explicit loop yourself; stream fusion is not magic. – Reid Barton Apr 24 '15 at 22:36
  • By the way, the array package that comes with GHC also includes unboxed and storable arrays. – Reid Barton Apr 24 '15 at 22:37
  • @ElectricCoffee: Not in the example I gave. One rule of thumb for stream fusion is this: if you sequentially generate a vector, that vector has only one consumer, and that consumer visits its elements sequentially, then no array will be allocated. The example I gave fits these conditions from start to end. – Luis Casillas Apr 24 '15 at 23:29
10

Haskell is an extremely high-level language, and you're asking a question about an extremely low-level detail.

Overall, Haskell's performance is probably similar to any garbage-collected language like Java or C#. In particular, Haskell has mutable arrays, which will have performance similar to any other array. (You may need unboxed arrays to match C performance.)

For something like a fold, if the final result is something like a machine integer, that probably ends up in a processor register for the entire duration of the loop. So the final machine code is pretty much identical to “a continuously-accessed variable in C”. (If the result is a dictionary or something, then probably not. But that's the same as C as well.)

More generally, if locallity is something that matters to you, any garbage-collected language probably isn't your friend. But, again, you can use unboxed arrays to work around that.

All of this talks is great and all, but if you really want to know how fast a specific Haskell program is, benchmark it. It turns out well-written Haskell programs are usually quite fast. (Just like most compiled languages.)

Added: You can ask GHC to output partially-compiled code in Core format, which is lower-level than Haskell but higher-level than machine code. This lets you see what the compiler has decided to do (in particular, where stuff has been inlined, where abstractions have been removed, etc.) This can help you find out what the final code looks like, without having to go all the way down to machine code.

MathematicalOrchid
  • 61,854
  • 19
  • 123
  • 220
  • 1
    "any garbage-collected language probably isn't your friend" Depends. The opposite can be true. Objects allocated close in time end up close in memory. Very nice locality. – usr Apr 24 '15 at 10:23
  • @usr Also depends on the size of the generation-1 heap, the GC frequency, the generation promotion time, and a bunch of other stuff that varies from program to program, yes. :-} – MathematicalOrchid Apr 24 '15 at 10:24
  • The GC tends to preserve locality. It removes dead holes and shoves live objects closer together. See http://stackoverflow.com/questions/14023988/why-is-processing-a-sorted-array-slower-than-an-unsorted-array/14024191#14024191 for an example of locality under GC. – usr Apr 24 '15 at 10:28
  • Agree with usr here. The main alternative of GC is heap allocation of stable pointers (Rust, C++), which is afflicted by fragmentation and slow allocation of small objects, necessitating custom allocators ("arenas") in some cases. Copying GC mostly helps locality per se, although it creates some space overhead (info fields, etc.) which crowds out some useful data from cache. I think overall it helps more than it hinders. – András Kovács Apr 24 '15 at 10:33
  • 3
    @ElectricCoffee GHC used to compile to C. I gather that code path is now only used for porting; by default it compiles through the native backend, or it compiles to LLVM if you select that. (I might be wrong tho...) – MathematicalOrchid Apr 24 '15 at 10:56
  • 3
    @MathematicalOrchid Might want to add a sentence or a paragraph devoted to reading the Core as well. That usually gives a very good idea as to which constructs got optimised down to tight loops and which ones didn't. – kqr Apr 24 '15 at 11:11