4

So I am trying to break down a corpus of 40,000 articles into the tf-idf weights for every word in the article. I have about 300MB of reviews. However, when I try to analyze even a small subset of these reviews (~1000), I get an extraordinarily waxing memory consumption. It takes about 600MB to tf-idfize 1000 reviews. This is unacceptable

A heap analysis shows, as expected, that all the memory (~550MB) is going being allocated for ByteStrings. This seems high, considering that that the first 1000 reviews only comprise 50MB. Additionally, I am not even retaining the full-text bodies of the reviews. I've tried adding in strictness (which usually fixed the problem) but it has benefitted little from the annotations. I have also tried out a linear hashtable instead of a basic hashtable but the performance was the same.

I suspect that there is some problem with the reduction of the foldM. Most of the time/alloc is spent around the extractReview logic. But I can't see any obvious offenders.

Any help would be appreciated.

The relevant code (with some helper functions omitted):

processReview :: Int -> [Review] -> String -> IO [Review]
processReview n stack file = do !raw <- B.readFile file
                                !newr <- extractReview n raw
                                return $ newr : stack

extractReview :: Int -> B.ByteString -> IO Review
extractReview n  r = do  !new_ngrams <- count_ngrams n body
                         return $ Review {ngrams = new_ngrams, url = safeNode url, isbns = map strContent isbns} 
                     where (Just !elem) = parseXMLDoc r
                           !body = cleanUTF8 $ B8.pack $ safeNode $ findElement (QName "body" Nothing Nothing) elem
                           !isbns = findElements (QName "isbn" Nothing Nothing) elem
                           !url = findElement (QName "url" Nothing Nothing) elem
                           safeNode = maybe "" (\m -> strContent m)

count_ngrams :: Int -> BL.ByteString -> IO Ngrams
count_ngrams n rbody = do !new_list <- H.new
                          !ngrams <- foldM (\h w -> let !w' = lowercase w in if elem w' ignore_words then return h                                                                                                                               
                                                                                                     else increment_ngram 1 h w') new_list word_list
                          return ngrams
                        where !just_words = BL.filter (\c -> c == 32 || (c >= 65 && c <= 90) || (c >= 97 && c <= 122)) (rbody)
                              !word_list = BL.split 32 just_words

increment_ngram :: Int -> Ngrams -> BL.ByteString -> IO Ngrams
increment_ngram amount ns word = do count <- H.lookup ns word
                                    case count of
                                         (Just i) -> H.insert ns word (i + amount)
                                         Nothing -> H.insert ns word amount
                                    return ns

sumNgrams :: [Review] -> IO Ngrams
sumNgrams reviews = do dict <- H.new
                       mapM_ (\r -> H.mapM_ (\(k,v) -> increment_ngram 1 dict k) (ngrams r)) reviews 
                       return dict                        


main = do
       [n] <- getArgs
       ngrams <- H.new :: IO (H.BasicHashTable Review Ngrams)
       reviews <- fmap (map (\c -> "./reviews/" ++ c) . filter (isInfixOf "xml") . take 500) $ getDirectoryContents "./reviews"
       analyzed_reviews <- foldM (\stack r -> processReview (read n) stack r) [] reviews
Erik Hinton
  • 1,948
  • 10
  • 15
  • Your `main` function is not complete. What do you do with `analyzed_reviews`? – Tom Ellis Sep 20 '13 at 19:26
  • Your use of `foldM` in `main` is redundant. You're just using it to do a `mapM`, since you don't actually use `stack`. – Tom Ellis Sep 20 '13 at 19:45
  • Sorry, I just truncated it after that line because the analysis showed that relevant cost centres were above it. Even if I delete the rest of main after than line and just `return ()` I have the same memory issues. – Erik Hinton Sep 20 '13 at 19:45
  • Yes, I was originally mapM-ing but I transformed it into a foldM just to make sure the mapM wasn't abstracting anything that was messing up the form reductions. – Erik Hinton Sep 20 '13 at 19:46
  • Do you know that `foldM` is not strict in its accumulator? You might like to try writing a strict version. – Tom Ellis Sep 20 '13 at 19:47
  • Why are you passing `n` around? You don't use it. Could you replace your code with code that doesn't have any errors with `-Wall`? That would make it much easier to read. – Tom Ellis Sep 20 '13 at 19:50
  • In any case, my first attempt at a fix would be to use a version of `foldM` that is strict in its accumulator, and in fact just use `Data.Map` rather than a hashtable in `IO` since the latter is likely to confuse the analysis. – Tom Ellis Sep 20 '13 at 19:51
  • Ah, yes, I am in the middle of writing support for 2 and 3-grams. I just stripped out those features to focus on the space issues. Also, as far as foldM's strictness, http://stackoverflow.com/questions/8919026/does-haskell-have-foldlm seems to suggest that foldM' might be redundant. Could be something to try? – Erik Hinton Sep 20 '13 at 19:54
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/37754/discussion-between-erik-hinton-and-tom-ellis) – Erik Hinton Sep 20 '13 at 19:58
  • Doesn't tf-idf require that you keep EVERY SINGLE WORD in order to check the corpus? So usage should be articles plus the a hash of all article worlds from the articles. Not something you probably want to keep in memory. – Xeoncross Nov 04 '14 at 16:40

0 Answers0