4

I have a following program in Haskell:

processDate :: String -> IO ()
processDate date = do
    ...
    let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
    geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let addresses = fmap location properties
    let batchAddresses = chunksOf 100 addresses
    batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return (zip properties geocodedLocations)

geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
    mapQuestKey <- getEnv "MAP_QUEST_KEY"
    geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
    return $ geocodeResponseToResults geocodeResponse

geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
    latLangs
    where
        decodedResponse :: Maybe GeocodingResponse
        decodedResponse = decodeGeocodingResponse inputResponse

        latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse

decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse  

It reads a list of properties (homes and apartments) from html files, parses them, geocodes the addresses and saves the results into sqlite db.
Everything works fine except for a very high memory usage (around 800M).
By commenting code out I have pinpointed the problem to be the geocoding step.
I send 100 addresses at a time to MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
The response for 100 addresses is quite massive so it might be one of the culprits, but 800M? I feel like it holds to all of the results until the end which drives the memory usage so high.

After commenting out the geocoding part of the program memory usage is around 30M which is fine.

You can get the full version which reproduces the issue here: https://github.com/Leonti/haskell-memory-so

enter image description here

I'm quite a newbie in Haskell, so not sure how I can optimize it.
Any ideas?

Cheers!

Michael
  • 2,889
  • 17
  • 16
Leonti
  • 10,400
  • 11
  • 43
  • 68
  • I suspect that GC just does not kick in, because you have plenty of RAM available, and not running GC is faster than running it needlessly. It's a pretty common pattern in GC-ed languages. Try limiting the heap available and see if it still fits. – 9000 Jan 28 '17 at 00:54
  • 1
    @9000 That is unlikely to help. The OP is probably right about intermediate results being held for too long. The `mapM` in `propertiesWithGeocoding` is a likely culprit (and, if that's indeed the case, the answers here will likely involve streaming libraries such as *pipes* and *conduit*, which are commonly used to provide alternatives to `mapM` when dealing with large amounts of data). – duplode Jan 28 '17 at 01:47
  • 1
    I wish I had a runnable fragment. The solution may partly be to use a streaming library like `streaming` or `conduit` or `pipes`. But there are other peculiarities. Keep in mind first that aeson does tend to accumulate the whole input in memory for parsing. (This is the only way to do it that would work for all json). There is a `json-stream` library can get around this in some cases depending what you are looking for in the json. – Michael Jan 28 '17 at 03:20
  • 1
    Another peculiarity is the way you use properties twice over in `propertiesWithGeocoding`, if I'm following. First you run along it to get locations, then you chunk it, then you use `mapM` and `sequence` (these are where are streaming library like `streaming` helps and it would nice to have a runnable fragment to show how refactoring goes), All of this, so far, can be made to 'stream' beautifully without `mapM` which tends to accumulate the whole list. – Michael Jan 28 '17 at 04:11
  • 2
    Finally though, you `zip` the result with `properties` again, if I follow. This is a problem for a properly streaming (= low memory use) program since it entails that the first time you went through `properties` everything has to be saved since the compiler sees it will need it again. – Michael Jan 28 '17 at 04:16
  • I've created a git repo here: https://github.com/Leonti/haskell-memory-so This one reads list of addresses (the same amount as properties) and geocodes them. Strangely this one doesn't exibit any memory issues. I see 8 spikes up to 11M which correspond to 8 calls to MapQuest. Not even near to 800M. Program without geocoding consumes 30M, program with just geocoding consumes 11M. If I combine them I get 800M :( – Leonti Jan 28 '17 at 08:29
  • I see two possible problems: 1) you are using `Data.Aeson.decode` instead of the stricter `Data.Aeson.decode'`, which builds less thunks. 2) You seem to be getting the response to the http request as a `String`. `String` is a very wasteful representation in Haskell. It would be more efficient to get the response as a `ByteString` and pipe it into `decode'` without the need for `fromString`. What http library are you using? – danidiaz Jan 28 '17 at 10:37
  • Guys, here is the version which you can run and which has the memory problem: https://github.com/Leonti/haskell-memory-so – Leonti Jan 28 '17 at 11:43
  • @danidiaz I think it might reduce the memory a little, but not by a lot. Version that just geocodes the addresses consumes only 11M at peaks (using `String` and `Data.Aeson.decode`). – Leonti Jan 28 '17 at 11:47
  • When you were shocked by the memory use, I take it you were sending more than 7 or 8 batches of 100? – Michael Jan 28 '17 at 11:49
  • @Michael not really - those are the same addresses, so in high-consumption case it still processes around 780 addresses (8 batches). I've updates the version on github which reproduces the bug. Could you take a look at it? – Leonti Jan 28 '17 at 11:53
  • I can't run the new version but I made a patch https://github.com/Leonti/haskell-memory-so/pull/1 that helped reduce memory pile-up with the previous version you had on github. I doubt it will have much effect but it will at least be able to get rid of one hypothesis about the problem. – Michael Jan 28 '17 at 13:48
  • One simple modernization like @danidiaz is talking about would be to use http://hackage.haskell.org/package/http-client-0.5.5/docs/Network-HTTP-Client.html `httpLbs` to get a lazy bytestring to feed directly to aeson. – Michael Jan 28 '17 at 13:50
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/134269/discussion-between-michael-and-leonti). – Michael Jan 28 '17 at 18:55

1 Answers1

3

It might be worth recording that this turned out to be a simple streaming problem arising from use of mapM and sequence, which with replicateM and traverse and other things that make you "extract a list from IO" always raise accumulation worries. So a little detour by a streaming library was needed. So in the repo it was necessary just to replace

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date
    allProperties <- mapM fileToProperties allFiles
    let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
    geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
    print geocodedProperties

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let batchProperties = chunksOf 100 properties
    batchGeocodedLocations <- mapM geocodeAddresses batchProperties
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return geocodedLocations

with something like this

import Streaming
import qualified Streaming.Prelude as S

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date   -- we accept an unstreamed list
    S.print $ propertiesWithGeocoding -- this was the main pain point see below
            $ S.filter hasPrice 
            $ S.concat 
            $ S.mapM fileToProperties -- this mapM doesn't accumulate
            $ S.each allFiles    -- the list is converted to a stream

propertiesWithGeocoding
  :: Stream (Of ParsedProperty) IO r
     -> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =  
    S.concat $ S.concat 
             $ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
             $ S.mapped S.toList       -- convert segments to haskell lists
             $ chunksOf 100 properties -- this is the streaming `chunksOf`
    -- concat here flattens a stream of lists of as into a stream of as
    -- and a stream of maybe as into a stream of as

Then the memory use looks like so, each peak corresponding to a trip to Mapquest promply followed by a little processing and a print, whereupon ghc forgets all about it and moves on:

Of course this could be done with pipes or conduit. But here we just need a little bit of simple mapM / sequence/ traverse / replicateM avoidance and streaming is perhaps simplest for this sort of quick local refactoring. Note that this list is quite short so the thought 'but short lists are cool with mapM/traverse/etc !" can be quite spectacularly false. Why not just get rid of them? Whenever you are about to write list mapM f it is a good idea to consider S.mapM f . S.each (or conduit or pipes equivalent) . You will now have a stream and can recover a list with S.toList or an equivalent, but it is likely that, as in this case, you will find you don't need a reified accumulated list but can e.g. use some streaming process like printing to file or stdout or writing things to a database, after making whatever list like manipulations are needed (here we use eg. streaming filter and also concat to flatten streamed lists and as a sort of catMaybe).

Michael
  • 2,889
  • 17
  • 16