So I am trying to break down a corpus of 40,000 articles into the tf-idf weights for every word in the article. I have about 300MB of reviews. However, when I try to analyze even a small subset of these reviews (~1000), I get an extraordinarily waxing memory consumption. It takes about 600MB to tf-idfize 1000 reviews. This is unacceptable
A heap analysis shows, as expected, that all the memory (~550MB) is going being allocated for ByteStrings. This seems high, considering that that the first 1000 reviews only comprise 50MB. Additionally, I am not even retaining the full-text bodies of the reviews. I've tried adding in strictness (which usually fixed the problem) but it has benefitted little from the annotations. I have also tried out a linear hashtable instead of a basic hashtable but the performance was the same.
I suspect that there is some problem with the reduction of the foldM. Most of the time/alloc is spent around the extractReview logic. But I can't see any obvious offenders.
Any help would be appreciated.
The relevant code (with some helper functions omitted):
processReview :: Int -> [Review] -> String -> IO [Review]
processReview n stack file = do !raw <- B.readFile file
!newr <- extractReview n raw
return $ newr : stack
extractReview :: Int -> B.ByteString -> IO Review
extractReview n r = do !new_ngrams <- count_ngrams n body
return $ Review {ngrams = new_ngrams, url = safeNode url, isbns = map strContent isbns}
where (Just !elem) = parseXMLDoc r
!body = cleanUTF8 $ B8.pack $ safeNode $ findElement (QName "body" Nothing Nothing) elem
!isbns = findElements (QName "isbn" Nothing Nothing) elem
!url = findElement (QName "url" Nothing Nothing) elem
safeNode = maybe "" (\m -> strContent m)
count_ngrams :: Int -> BL.ByteString -> IO Ngrams
count_ngrams n rbody = do !new_list <- H.new
!ngrams <- foldM (\h w -> let !w' = lowercase w in if elem w' ignore_words then return h
else increment_ngram 1 h w') new_list word_list
return ngrams
where !just_words = BL.filter (\c -> c == 32 || (c >= 65 && c <= 90) || (c >= 97 && c <= 122)) (rbody)
!word_list = BL.split 32 just_words
increment_ngram :: Int -> Ngrams -> BL.ByteString -> IO Ngrams
increment_ngram amount ns word = do count <- H.lookup ns word
case count of
(Just i) -> H.insert ns word (i + amount)
Nothing -> H.insert ns word amount
return ns
sumNgrams :: [Review] -> IO Ngrams
sumNgrams reviews = do dict <- H.new
mapM_ (\r -> H.mapM_ (\(k,v) -> increment_ngram 1 dict k) (ngrams r)) reviews
return dict
main = do
[n] <- getArgs
ngrams <- H.new :: IO (H.BasicHashTable Review Ngrams)
reviews <- fmap (map (\c -> "./reviews/" ++ c) . filter (isInfixOf "xml") . take 500) $ getDirectoryContents "./reviews"
analyzed_reviews <- foldM (\stack r -> processReview (read n) stack r) [] reviews