I've got a 279MB file that contains ~10 million key/value pairs, with ~500,000 unique keys. It's grouped by key (each key only needs to be written once), so all the values for a given key are together.
What I want to do is transpose the association, create a file where the pairs are grouped by value, and all the keys for a given value are stored together.
Currently, I'm using Parsec to read in the pairs as a list of tuples (K,[V])
(using lazy IO so I can process it as a stream while Parsec is processing the input file), where:
newtype K = K Text deriving (Show, Eq, Ord, Hashable)
newtype V = V Text deriving (Show, Eq, Ord, Hashable)
tupleParser :: Parser (K,[V])
tupleParser = ...
data ErrList e a = Cons a (ErrList e a) | End | Err e
parseAllFromFile :: Parser a -> FilePath-> IO (ErrList ParseError a)
parseAllFromFile parser inputFile = do
contents <- readFile inputFile
let Right initialState = parse getParserState inputFile contents
return $ loop initialState
where loop state = case unconsume $ runParsecT parser' state of
Error err -> Err err
Ok Nothing _ _ -> End
Ok (Just a) state' _ -> a `Cons` loop state'
unconsume v = runIdentity $ case runIdentity v of
Consumed ma -> ma
Empty ma -> ma
parser' = (Just <$> parser) <|> (const Nothing <$> eof)
I've tried to insert the tuples into a Data.HashMap.Map V [K]
to transpose the association:
transpose :: ErrList ParseError (K,[V]) -> Either ParseError [(V,[K])]
transpose = transpose' M.empty
where transpose' _ (Err e) = Left e
transpose' m End = Right $ assocs m
transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs
include k m v = M.insertWith (const (k:)) v [k] m
But when I tried it, I got the error:
memory allocation failed (requested 2097152 bytes)
I could think of a couple things I'm doing wrong:
- 2MB seems a bit low (considerably less than the 2GB RAM my machine has installed), so maybe I need to tell GHC it's ok to use more?
- My problems could be algorithmic/data structure related. Maybe I'm using the wrong tools for the job?
- My attempt to use lazy IO could be coming back to bite me.
I'm leaning toward (1) for now, but I'm not sure by any means.